Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
R
 
 
 
 
man
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Hrate

Introduction

This package implements the increasing window entropy estimator (i.e., LZ78) as described in [1], [2] and [3] to estimate the entropy rate of a string of symbols. In the example given here, symbols are words, and the string is a pre-processed text of the German version of the European Parliament Corpus (EPC) [4]. The package provides two main functions: stabilize.estimate() and get.estimate(). The first function calculates the entropy rate for a pre-specified set of token counts. This helps to establish the number of tokens necessary for entropy rates to stabilize. The second function gives a single estimate for a pre-specified number of tokens.

Installation

Install either directly from GitHub

library(devtools)
install_github("dimalik/Hrate")

or from the tarball

install.packages("Hrate.tar.gz", repos=NULL, type="source")

Usage

library(Hrate)

## load the provided demo corpus: 50 K tokens of the German Europarl 
data(deuparl)
## punctuation and numbers are removed and word tokens are set to lower case
deuparl[1:10]

[1] "wiederaufnahme"  "der"             "sitzungsperiode" "ich"             "erkläre"
[6] "die"             "am"              "freitag"         "dem"             "dezember"

## get the stabilization points via stabilize.estimate()
## "step.size" is an argument specifying step sizes (in number of tokens) at which entropy rates are calculated. "max.length" specifies the maximum number of tokens to be included. "every.word=1" specifies that each word token should be used for estimation. To speed up processing only every 2nd, 3rd, xth word token could be used. Hence, every.word can be assigned any integer between 1 and the step size. "rate" gives the downsampling rate to get SDs over a given number of entropy rate estimations (see Section 4.2 in [3]). converge.estimate returns a S4 object. 
stabilization.rate <- stabilize.estimate(text = deuparl, step.size = 1000, max.length = 50000, every.word = 10, method="downsample",rate = 5)
                                     
## output the convergence rate and the associated SDs
print(get.stabilization(stabilization.rate))
print(get.criterion(stabilization.rate))

## we also expose a get.estimate function to get a single estimate on a text. "every.word""
## and "max.length" are the same arguments as for converge.estimate().
estimate <- get.estimate(text = deuparl, every.word = 10, max.length = 50000)

## plot the stabilization results
plot(convergence.rate)

References

  1. Gao, Y.; Kontoyiannis, I.; Bienenstock, E. Estimating the entropy of binary time series: Methodology, some theory and a simulation study. Entropy 2008, 10, 71–99.

  2. Bentz, C.; Alikaniotis D. The word entropy of natural languages, arXiv 2016.

  3. Bentz, C.; Alikaniotis D.; Cysouw, M.; Ferrer-i-Cancho, R. The entropy of words - learnability and expressivity across more than 1000 languages. Entropy 2017.

  4. Koehn, P. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the tenth Machine Translation Summit, Phuket, Thailand, 12-16 September 2005; Volume 5, pp. 79–86.

Travis-CI Build Status

About

Hrate package for R

Topics

Resources

License

Releases

No releases published

Languages

You can’t perform that action at this time.