Write vignette #4

carloscinelli · 2015-10-06T21:52:31Z

Write package vignette.

Henry-E · 2016-03-18T11:43:07Z

It would be really helpful if you could outline how to interpret the values returned from the print and plot functions

Henry-E · 2016-07-25T13:56:17Z

Hey, I made up an rmarkdown document comparing a couple of different distributions that typically follow benfords law (lognormal with high standard deviation, exponential) and some that don't (lognormal with small standard deviation, normal, uniform).

In terms of using the output of the print and plot functions to determine which groups of numbers follow benfords law, I'm having trouble understanding how to interpret them succesfully. Aside from the classic digits distribution chart, it's hard to see what indicators could be used in identifying rogue sets of numbers. I know you say to not focus on the p-values but for many of the distrubtions that I've looked at in practice that appear to follow Benford's law, still have tiny p-values (< 1e5 in many cases).

Do you have any reccomedations or tips for how you use the package. Like yourself, I'm trying to apply this in a central bank.

---
title: "Benford's law, digit analysis"
output: 
  html_document:
    toc: true
    toc_float: true
# params:
#   numberOfDigits: 1
#   columnToGroupBy: NA
#   data: NA
---

```{r setup, echo=FALSE,  warning=FALSE, message=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(dplyr)
library(stringr)
library(benford.analysis)
library(pander)
options(stringsAsFactors = FALSE)

numObs <- 1e4
data <- tbl_df(bind_rows(data.frame(value = runif(numObs), distribution = "uniform"), 
                         data.frame(value = abs(rnorm(numObs)), distribution = "normal"),
                         data.frame(value = rexp(numObs, rate = 1), distribution = "exponential"),
                         data.frame(value = rlnorm(numObs, meanlog = 1, sdlog = 0.6), distribution = "lognormal - narrow"),
                         data.frame(value = rlnorm(numObs, meanlog = 10, sdlog = 100), distribution = "lognormal - wide")))
columnToGroupBy <- c("distribution")

params <- list()
params$numberOfDigits <- 1
params$columnToGroupBy <- columnToGroupBy
params$data <- data

# we tried to use NULL here before but that leads to an issue
# https://github.com/rstudio/rmarkdown/issues/729
if(params$columnToGroupBy == "NA" || params$data == "NA")
  stop("missing column name(s) or data")

if(ncol(params$data) < 2)
  stop("not enough columns in the data")

areColumnsPresent <- c(params$columnToGroupBy, "value") %in% colnames(params$data)
if(!all(areColumnsPresent))
  stop(str_c(c("columns requested are not present:", params$columnToGroupBy), collapse = "\n"))

dataTypeOfValueColumn <- sapply(params$data, class)["value"]
if(dataTypeOfValueColumn != "integer" && dataTypeOfValueColumn != "numeric")
  stop(str_c("Data type of the value column should be integer or numeric, instead it is: ", dataTypeOfValueColumn))

if(params$numberOfDigits > 2)
  stop(str_c("checking that number of digits is probably not a great idea ", params$numberOfDigits))
```

```{r benfordAnalysis, results = "asis"}
# the do() function applies the benford function to each group
# and create a list of S3 objects
# https://stat545-ubc.github.io/block023_dplyr-do.html#meet-do
allTheBenfords <- params$data %>%
  group_by_(.dots = params$columnToGroupBy) %>%
  do(benford =  benford(.$value, number.of.digits = params$numberOfDigits))

# order them by the Mean Absolute Deviation, it's not a totally accurate way to do
# it but it's better than nothing. I couldn't figure out how to do it as part
# the dplyr chain, the S3 object subsetting got the better of me. I was able to
# use do and the .(dot) operator to get a column of the values but it deleted all
# the other columns for some reason and I couldn't use the same syntax with mutate
allTheBenfords <- allTheBenfords[order(sapply(allTheBenfords$benford,
                                              function(x) x[["MAD"]]), decreasing = TRUE), ]
# We're testing out ranking by the mantissa arc 
# allTheBenfords <- allTheBenfords[order(sapply(allTheBenfords$benford, 
#                                               function(x) x$stats$mantissa.arc.test$p.value), decreasing = TRUE), ]

for(observation in seq(nrow(allTheBenfords))){

  thisBenfordAnalysis <- allTheBenfords$benford[[observation]]
  cat("\n\n## ", allTheBenfords[observation, params$columnToGroupBy][[1]], "\n\n")

  plot(thisBenfordAnalysis)
  cat("\n")

  # cribbed from the print function for benford.analysis class, which does not
  # print well in rmarkdown
  # https://github.com/carloscinelli/benford.analysis/blob/master/R/functions-new.R
  overallDescription <- data.frame(description =
                                  c("Number of observations used", 
                                     "Number of obs. for second order", 
                                     "First digits analysed"),
                                  value =
                                   c(thisBenfordAnalysis[["info"]]$n, 
                                     thisBenfordAnalysis[["info"]]$n.second.order,
                                     thisBenfordAnalysis[["info"]]$number.of.digits))
  cat("\n#### Overall description of analysis\n")
  cat(pandoc.table(overallDescription))

  cat("\n\n#### Mantissa arc test \n")
  mantissaTable <- thisBenfordAnalysis$mantissa
  mantissaTable$statistic <-  gsub("Mantissa|\\s", " ", mantissaTable$statistic)
  mantissaTable$`expected value` <- c(0.5, 0.08333, -1.2, 0)
  cat(pandoc.table(mantissaTable))
  cat("\n\n#### Numerical deviations")
  how.many <- 5
  cat("\nThe", how.many, "largest deviations: \n\n")
  cat(pandoc.table(round(head(thisBenfordAnalysis[["bfd"]][order(absolute.diff, decreasing=TRUE)][,list(digits, absolute.diff)], how.many), 2)), sep="\n")
  deviationStats <- data.frame(description =
                                 c("Mean Absolute Deviation",
                                   "Distortion Factor"),
                               value =
                                 c(thisBenfordAnalysis[["MAD"]],
                                   thisBenfordAnalysis[["distortion.factor"]]))
  cat(pandoc.table(deviationStats))

  cat("\n\n#### Statistics\n")
  cat("\n\n##### Pearson's Chi-squared test\n")
  chisqStats <- data.frame(description =
                            c("X-squared",
                              "df",
                              "p-value <"),
                           value =
                             c(thisBenfordAnalysis[["stats"]]$chisq$statistic,
                               thisBenfordAnalysis[["stats"]]$chisq$parameter,
                               thisBenfordAnalysis[["stats"]]$chisq$p.value))
  cat(pandoc.table(chisqStats))
  cat("\n\n##### Mantissa Arc Test\n")
  mantissaStats <- data.frame(description = 
                                c("L2",
                                  "df",
                                  "p-value <"),
                              value = 
                                c(thisBenfordAnalysis[["stats"]]$mantissa.arc.test$statistic,
                                  thisBenfordAnalysis[["stats"]]$mantissa.arc.test$parameter,
                                  thisBenfordAnalysis[["stats"]]$mantissa.arc.test$p.value))
  cat(pandoc.table(mantissaStats))
  cat("\n\nRemember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!\n\n")
}
```

carloscinelli · 2016-08-17T13:38:19Z

Thanks for the comment, Henry! I will take a look at your example and try to explain it better. But just to give a quick answer, the focus here is to have one more indicator (which you will combine with other indicators) to find suspicious data that will need further investigation. You shouldn't use it to get a clear cut "yes/no" answer about the quality of a data set or even a single data point.

vonjd · 2019-12-23T08:39:15Z

When will the vignette be ready?

carloscinelli added the enhancement label Oct 6, 2015

carloscinelli added this to the v0.1.3 milestone Oct 6, 2015

carloscinelli modified the milestones: v0.1.5, v0.1.3 Nov 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write vignette #4

Write vignette #4

carloscinelli commented Oct 6, 2015

Henry-E commented Mar 18, 2016

Henry-E commented Jul 25, 2016 •

edited

Loading

carloscinelli commented Aug 17, 2016

vonjd commented Dec 23, 2019

Write vignette #4

Write vignette #4

Comments

carloscinelli commented Oct 6, 2015

Henry-E commented Mar 18, 2016

Henry-E commented Jul 25, 2016 • edited Loading

carloscinelli commented Aug 17, 2016

vonjd commented Dec 23, 2019

Henry-E commented Jul 25, 2016 •

edited

Loading