<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Benford's Law

_Authors: Riley Dallas (AUS)_

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

pd.set_option('display.max_columns', 999)

## Learning Objectives
---

- Understand what Benford's law is
- Use Benford's law to find anomalies within a dataset.


## About Benford's Law
---

> Benford's law is an observation about the frequency distribution of leading digits in many real-life sets of numerical data. The law states that in many naturally occurring collections of numbers, the leading significant digit is likely to be small. For example, in sets that obey the law, the number 1 appears as the most significant digit about 30% of the time, while 9 appears as the most significant digit less than 5% of the time. If the digits were distributed uniformly, they would each occur about 11.1% of the time. Benford's law also makes predictions about the distribution of second digits, third digits, digit combinations, and so on. - From [Wikipedia](https://en.wikipedia.org/wiki/Benford%27s_law)

Benford's law has many great applications in exploratory data analysis. To give an example, forensic accountants typically use Benford's law to screen for irregularities in financial reports. Financial datasets tend to follow Benford's law **unless human intervention is involved**, which can sometimes lead to fraud.

For today's lesson, we'll apply Benford's law to the [City of Austin's Online Checkbook](https://data.austintexas.gov/Budget-and-Finance/Austin-Finance-Online-eCheckbook/8c6z-qnmj) to explore anomalies in the data.

## Challenge: Bar chart
---

For digits 1-9, create a bar chart that shows the expected percentage of first digits in the dataset according to Benford's law:

$P(d)=\log_{10}(d+1)-\log_{10}(d)=\log_{10} \left(\frac{d+1}{d}\right)=\log_{10} \left(1+\frac{1}{d}\right)$

For example, we expect to see numbers with a 1 as the first digit roughly 30% of the time:

$P(d)=\log_{10}(1+1)-\log_{10}(1) = .301$

## Load the dataset
---

We've already downloaded a csv from the City of Austin's website (`datasets/Austin_Finance_Online_eCheckbook.csv`). Load that into a `pandas` DataFrame.

## Data cleaning
---

Remove all rows where the `AMOUNT` is less than 10 dollars. This will clean up the data by removing small amounts (including refunds), without sacrificing a lot of data.

## Feature Engineering: Add a `FIRST_DIGIT` category
---

Create a column called `FIRST_DIGIT` corresponding to the first digit from the `AMOUNT` column.

## Plot Actual Percentages vs Expected (Benford) Percentanges
---

Next we'll plot a line chart of actual percentages against the expected Benford percentages as a bar chart.

## Evaluating Benford's Law by Department
---

Let's repeat the chart above, only this time we'll create a line/bar chart for each department.

To get started, create a list of deparment names that have written at least 1,000 checks.

Now we'll plot actuals vs expected for each of those departments

## Deep dive
---

We can see a few anomalies with Police, EMS, Austin Resource Recovery and Parks & Rec. Let's dig deeper and investigate the root causes. For this exercise, pick a department that is deviating from Benford's law and find out why.