Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write vignette #4

Open
carloscinelli opened this issue Oct 6, 2015 · 4 comments
Open

Write vignette #4

carloscinelli opened this issue Oct 6, 2015 · 4 comments
Milestone

Comments

@carloscinelli
Copy link
Owner

Write package vignette.

@carloscinelli carloscinelli added this to the v0.1.3 milestone Oct 6, 2015
@carloscinelli carloscinelli modified the milestones: v0.1.5, v0.1.3 Nov 17, 2015
@Henry-E
Copy link

Henry-E commented Mar 18, 2016

It would be really helpful if you could outline how to interpret the values returned from the print and plot functions

@Henry-E
Copy link

Henry-E commented Jul 25, 2016

Hey, I made up an rmarkdown document comparing a couple of different distributions that typically follow benfords law (lognormal with high standard deviation, exponential) and some that don't (lognormal with small standard deviation, normal, uniform).

In terms of using the output of the print and plot functions to determine which groups of numbers follow benfords law, I'm having trouble understanding how to interpret them succesfully. Aside from the classic digits distribution chart, it's hard to see what indicators could be used in identifying rogue sets of numbers. I know you say to not focus on the p-values but for many of the distrubtions that I've looked at in practice that appear to follow Benford's law, still have tiny p-values (< 1e5 in many cases).

Do you have any reccomedations or tips for how you use the package. Like yourself, I'm trying to apply this in a central bank.

---
title: "Benford's law, digit analysis"
output: 
  html_document:
    toc: true
    toc_float: true
# params:
#   numberOfDigits: 1
#   columnToGroupBy: NA
#   data: NA
---

```{r setup, echo=FALSE,  warning=FALSE, message=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(dplyr)
library(stringr)
library(benford.analysis)
library(pander)
options(stringsAsFactors = FALSE)

numObs <- 1e4
data <- tbl_df(bind_rows(data.frame(value = runif(numObs), distribution = "uniform"), 
                         data.frame(value = abs(rnorm(numObs)), distribution = "normal"),
                         data.frame(value = rexp(numObs, rate = 1), distribution = "exponential"),
                         data.frame(value = rlnorm(numObs, meanlog = 1, sdlog = 0.6), distribution = "lognormal - narrow"),
                         data.frame(value = rlnorm(numObs, meanlog = 10, sdlog = 100), distribution = "lognormal - wide")))
columnToGroupBy <- c("distribution")

params <- list()
params$numberOfDigits <- 1
params$columnToGroupBy <- columnToGroupBy
params$data <- data

# we tried to use NULL here before but that leads to an issue
# https://github.com/rstudio/rmarkdown/issues/729
if(params$columnToGroupBy == "NA" || params$data == "NA")
  stop("missing column name(s) or data")

if(ncol(params$data) < 2)
  stop("not enough columns in the data")

areColumnsPresent <- c(params$columnToGroupBy, "value") %in% colnames(params$data)
if(!all(areColumnsPresent))
  stop(str_c(c("columns requested are not present:", params$columnToGroupBy), collapse = "\n"))

dataTypeOfValueColumn <- sapply(params$data, class)["value"]
if(dataTypeOfValueColumn != "integer" && dataTypeOfValueColumn != "numeric")
  stop(str_c("Data type of the value column should be integer or numeric, instead it is: ", dataTypeOfValueColumn))

if(params$numberOfDigits > 2)
  stop(str_c("checking that number of digits is probably not a great idea ", params$numberOfDigits))
​​​​```

```{r benfordAnalysis, results = "asis"}
# the do() function applies the benford function to each group
# and create a list of S3 objects
# https://stat545-ubc.github.io/block023_dplyr-do.html#meet-do
allTheBenfords <- params$data %>%
  group_by_(.dots = params$columnToGroupBy) %>%
  do(benford =  benford(.$value, number.of.digits = params$numberOfDigits))

# order them by the Mean Absolute Deviation, it's not a totally accurate way to do
# it but it's better than nothing. I couldn't figure out how to do it as part
# the dplyr chain, the S3 object subsetting got the better of me. I was able to
# use do and the .(dot) operator to get a column of the values but it deleted all
# the other columns for some reason and I couldn't use the same syntax with mutate
allTheBenfords <- allTheBenfords[order(sapply(allTheBenfords$benford,
                                              function(x) x[["MAD"]]), decreasing = TRUE), ]
# We're testing out ranking by the mantissa arc 
# allTheBenfords <- allTheBenfords[order(sapply(allTheBenfords$benford, 
#                                               function(x) x$stats$mantissa.arc.test$p.value), decreasing = TRUE), ]

for(observation in seq(nrow(allTheBenfords))){

  thisBenfordAnalysis <- allTheBenfords$benford[[observation]]
  cat("\n\n## ", allTheBenfords[observation, params$columnToGroupBy][[1]], "\n\n")

  plot(thisBenfordAnalysis)
  cat("\n")

  # cribbed from the print function for benford.analysis class, which does not
  # print well in rmarkdown
  # https://github.com/carloscinelli/benford.analysis/blob/master/R/functions-new.R
  overallDescription <- data.frame(description =
                                  c("Number of observations used", 
                                     "Number of obs. for second order", 
                                     "First digits analysed"),
                                  value =
                                   c(thisBenfordAnalysis[["info"]]$n, 
                                     thisBenfordAnalysis[["info"]]$n.second.order,
                                     thisBenfordAnalysis[["info"]]$number.of.digits))
  cat("\n#### Overall description of analysis\n")
  cat(pandoc.table(overallDescription))

  cat("\n\n#### Mantissa arc test \n")
  mantissaTable <- thisBenfordAnalysis$mantissa
  mantissaTable$statistic <-  gsub("Mantissa|\\s", " ", mantissaTable$statistic)
  mantissaTable$`expected value` <- c(0.5, 0.08333, -1.2, 0)
  cat(pandoc.table(mantissaTable))
  cat("\n\n#### Numerical deviations")
  how.many <- 5
  cat("\nThe", how.many, "largest deviations: \n\n")
  cat(pandoc.table(round(head(thisBenfordAnalysis[["bfd"]][order(absolute.diff, decreasing=TRUE)][,list(digits, absolute.diff)], how.many), 2)), sep="\n")
  deviationStats <- data.frame(description =
                                 c("Mean Absolute Deviation",
                                   "Distortion Factor"),
                               value =
                                 c(thisBenfordAnalysis[["MAD"]],
                                   thisBenfordAnalysis[["distortion.factor"]]))
  cat(pandoc.table(deviationStats))

  cat("\n\n#### Statistics\n")
  cat("\n\n##### Pearson's Chi-squared test\n")
  chisqStats <- data.frame(description =
                            c("X-squared",
                              "df",
                              "p-value <"),
                           value =
                             c(thisBenfordAnalysis[["stats"]]$chisq$statistic,
                               thisBenfordAnalysis[["stats"]]$chisq$parameter,
                               thisBenfordAnalysis[["stats"]]$chisq$p.value))
  cat(pandoc.table(chisqStats))
  cat("\n\n##### Mantissa Arc Test\n")
  mantissaStats <- data.frame(description = 
                                c("L2",
                                  "df",
                                  "p-value <"),
                              value = 
                                c(thisBenfordAnalysis[["stats"]]$mantissa.arc.test$statistic,
                                  thisBenfordAnalysis[["stats"]]$mantissa.arc.test$parameter,
                                  thisBenfordAnalysis[["stats"]]$mantissa.arc.test$p.value))
  cat(pandoc.table(mantissaStats))
  cat("\n\nRemember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!\n\n")
}
​​​​```

@carloscinelli
Copy link
Owner Author

Thanks for the comment, Henry! I will take a look at your example and try to explain it better. But just to give a quick answer, the focus here is to have one more indicator (which you will combine with other indicators) to find suspicious data that will need further investigation. You shouldn't use it to get a clear cut "yes/no" answer about the quality of a data set or even a single data point.

@vonjd
Copy link

vonjd commented Dec 23, 2019

When will the vignette be ready?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants