# Demo 10 - Benford's Law

This is a demo of Benford's Law.  Benford's Law is an interesting numerical phenomenon relating to the first digit in a sequence of numbers.

In [None]:
install.packages('tidyverse', repos = "http://cran.us.r-project.org", dependencies = TRUE)
install.packages('benford.analysis', repos = "http://cran.us.r-project.org")

In [None]:
library(tidyverse)
library(benford.analysis)
library(MASS)

First, we will manually build the first-digit Benford data set.  I am adding 0.5 to each value to make it easier to see the point against the histogram backdrop.

In [None]:
benford.predictions <- data.frame(
  c(1.5,2.5,3.5,4.5,5.5,6.5,7.5,8.5,9.5),
  c(0.301,0.176,0.125,0.097,0.079,0.067,0.058,0.051,0.046)
)
names(benford.predictions)[1] <- "Digit"
names(benford.predictions)[2] <- "Frequency"

The following function strips out the first digit, as we only care about the first digit for our analysis.

In [None]:
firstdigit <- function(k) {
  as.numeric(head(strsplit(as.character(k), '')[[1]],n=1))
}

## North Carolina Population

For our first example, we will load the North Carolina population as of 2013.

In [None]:
nc.pop <- read.csv("Data/NorthCarolinaPopulation2013.csv", sep=",", header=TRUE)

In [None]:
tail(nc.pop, 5)

Now we will take the population data and get just the first digit.  We will build a histogram of the first digit of each city's population, and then overlay that with what our Benford analysis values would predict.

In [None]:
nc.pop.first <- sapply(nc.pop$Population, firstdigit)
truehist(nc.pop.first, nbins=10, ymax = 0.35)
points(benford.predictions)
lines(benford.predictions)

It's not perfect, but it certainly fits.

Now let's look at a fancier method of plotting Benford analysis, using the benford.analysis library that we loaded at the top of the notebook.

In [None]:
cp <- benford(data = nc.pop$Population, number.of.digits = 1, sign = "positive", discrete=TRUE, round=3)
plot(cp)

Following from the Benford analysis tutorial (https://github.com/carloscinelli/benford.analysis), we should expect values similar to the following when looking at the results:

|statistic|values|
|---------|------|
|mean|0.5|
|variance|0.0833 (1/12)|
|kurtosis|-1.2|
|skewness|0|


In [None]:
cp$mantissa

Our actual results are very close to the expected results.

### Two-Digit Sampling

Now let's try two-digit sampling.  Instead of looking at just the first digit of each number, let's look at the first two digits.  Benford's Law has a set of predictions for each two-digit pairing.

In [None]:
cp2 <- benford(data = nc.pop$Population, number.of.digits = 2, sign = "positive", discrete=TRUE, round=3)
plot(cp2)

Note:  the spikes in the second-order test does *not* indicate a problem; it indicates that the data is discrete, not continuous.  This is a common occurrence with discrete values packed into tight ranges.

## Homeowner's Association Budget

Our second example looks at my HOA's budget over a three-year period.

In [None]:
hoa <- read.table('Data/HOABudget.txt',header=T,sep='\t',quote="")

In [None]:
hoa.yearly <- hoa %>%
               #Use gather to unpviot our actuals & estimates by fiscal year into a single column
               gather(FiscalYear, Amount, X2013:X2015, na.rm = TRUE) %>%
               mutate(FiscalYear = substring(FiscalYear, 2, 5))
                  

Now that we have the data loaded for 2013-2015, let's analyze this.  First, we will analyze a single year at a time.  Then, we will analyze the entire data set.

In [None]:
hoa.2013 <- hoa.yearly %>% filter(FiscalYear == 2013)
cp <- benford(data = hoa.2013$Amount, number.of.digits = 1, sign = "positive", discrete=TRUE, round=3)
plot(cp, except=c("second order", "summation", "mantissa", "chi squared", "abs diff", "ex summation"))

In [None]:
hoa.2014 <- hoa.yearly %>% filter(FiscalYear == 2014)
cp <- benford(data = hoa.2014$Amount, number.of.digits = 1, sign = "positive", discrete=TRUE, round=3)
plot(cp, except=c("second order", "summation", "mantissa", "chi squared", "abs diff", "ex summation"))

In [None]:
hoa.2015 <- hoa.yearly %>% filter(FiscalYear == 2015)
cp <- benford(data = hoa.2015$Amount, number.of.digits = 1, sign = "positive", discrete=TRUE, round=3)
plot(cp, except=c("second order", "summation", "mantissa", "chi squared", "abs diff", "ex summation"))

In [None]:
cp <- benford(data = hoa.yearly$Amount, number.of.digits = 1, sign = "positive", discrete=TRUE, round=3)
plot(cp, except=c("second order", "summation", "mantissa", "chi squared", "abs diff", "ex summation"))

In [None]:
cp$mantissa

The mean is a bit lower than "perfect" and the kurtosis and skewness are both a little high.  We can see a Benford-like trend, but this doesn't quite fit.

But before we start accusing my HOA of siphoning funds off, let's look at one last important measure:

In [None]:
hoa.yearly %>% filter(Amount > 0) %>% count()

That is, there are only 260 relevant entries in the entire sample, so it's not a large sample size, and so we can expect some deviation from expectations.

## Last Digit Analysis

Now that we see Benford's Law holding for North Carolina's population and even somewhat for my local HOA, does the same phenomenon hold for the *last* digit of each number?

To figure this out, we first need to create a function to get the last digit of each number.

In [None]:
lastdigit <- function(k) {
  as.numeric(tail(strsplit(as.character(k), '')[[1]],n=1))
}

Next up, we will apply the function to each record in nc.pop, build a histogram, and overlay the Benford predictions.

In [None]:
nc.pop.last <- sapply(nc.pop$Population, lastdigit)
truehist(nc.pop.last, nbins=10, ymax = 0.35)
points(benford.predictions)
lines(benford.predictions)

Benford's Law emphatically does not fit here.  Furthermore, last digits are *not* expected to follow Benford's Law.  Instead, we should assume that the last digit is uniform unless there is a reason to believe otherwise.

Similarly, here is the last digit for my HOA budget values:

In [None]:
hoa.last <- sapply(hoa.yearly$Amount, lastdigit)
truehist(hoa.last, nbins=10)
points(benford.predictions)
lines(benford.predictions)

My HOA's last digit is almost always a 0 and it's not even close.  But there's a reason for us to expect this:  budgeted values tend to end in 0, as there is little value in false precision.  This is a case where we should not expect the uniform distribution to hold.