Normalized Frequency Difference Estimator
Supporting package for the Normalized Frequency Difference metric as described in Bentz et al..
If you already have
devtools then you can install the package directly from github:
otherwise, you can clone the package (or download the .zip file) and then install it either via the R GUI (or RStudio) or just invoke:
install.packages("path/to/zip", repos = NULL, type = "source")
Using the package
In order to get the NFD value between two discrete distributions you simply call:
library(nfd) ## example distributions from pg. 6 of the paper freqA <- c(45, 20, 15, 10, 5, 1, 1, 1, 1, 1) freqB <- rep(10, 10) NFD(freqA, freqB) ## > 0.5
NFD actually returns an R object which implements its own generic functions (such as
plot). You can, therefore, run:
n <- NFD(freqA, freqB) summary(n) # a nicely output summary plot(n) # a ggplot such as the ones in pg. 11
The package also includes as a demo corpus the english and italian translations of the UDHR on which you can either find the
NFD value or run a text size simulation.
data(udhr.demo) nfd.score <- NFD(udhr.demo$english, udhr.demo$italian)
Effect of text size simulations
nfd package also provided another class that helps you run simulations to find the effect of text size on the
NFD (or any similar measure).
text.size.sim <- TSsim(udhr.demo$english, udhr.demo$italian, max.size = 1000, random.sampling = TRUE)
max.sizecontrols the text size. It implicitly creates a sequence (i.e.
2:max.size) which at each step takes that big a chunk from each corpus. This assumes that
max.size < min(length(corpus1), length(corpus2)). If this does not hold
max.sizeis trimmed to be smaller than the smaller corpus. For larger corpora it is advisable to provide a custom sampling sequence (see below).
random.samplingcontrols whether the samples taken from the corpora will be random or from the beginning.
For more details see Analysis 3 in the paper.
- If you are going to run a text size simulation on larger corpora (> 100000 tokens) consider either setting the
random.sampling = FALSEor provide a more sparse sequence in
max.size(for an example see below)
S4objects the contents of which can be accessed using the
@symbol. For example, in order to get the
NFDobject you can type
n@nfd_value. Both classes provide a getValue() method which returns the
nfd_valueand the vector of nfd values, respectively.
TSsimclass lets you specify the frequency difference function (through
fun =). While the default is NFD you can use any function that given two vectors of numbers returns a scalar.
Suppose you have two 10mio token corpora you wish to find how the NFD value changes as we increase the text size. Running with
random.sampling = TRUE would be too cumbersome to run so it might be a good idea to take random text sizes:
my.max.size <- sort(sample(1e7, 50000, replace = FALSE))
you might also expect that as the size increases the differences are going to be much smaller (there is going to be minimial variation between 5 * 1e6 and 5 * 1e6 + 1). You can, therefore, supply a vector of probabilities on your call to sample such that you can weight more early values:
## some sort of weighting function max.size <- 1e7 wf <- function(k, t) 1 - ((k * t) / (k - t + 1)) rng <- seq(0, 1, length.out = max.size) probs <- wf(rng, .9) ## you can also try ## probs <- wf(.9, rng) my.max.size <- sort(sample(max.size, 5000, replace = FALSE, prob = probs))
- The actual NFD function is written in C (found in src/) so make sure you have a compiler set up. In Windows you might want to check RTools.
- On MacOS you should have command-line tools installed from Xcode (remember that on some systems you might need to agree with the new license agreement (for more check here) -thanks @cainesap).
For more information on the derivation of the measure you can consult Bentz et al. (2015). For more on how to perform different kinds of simulations either check the wiki page or evaluate