# Demo 5 - Wake County Expenditures

In this demo, we will look at Wake County expenditures, pulling data from SQL Server and performing some basic analysis.  Our goal here is to look for outliers in the transactions data set.

We will first load RODBC, which allows us to make connections to databases like SQL Server.

In [None]:
install.packages("RODBC", repos = "http://cran.us.r-project.org")
install.packages("tidyverse", repos = "http://cran.us.r-project.org")

In [None]:
library(RODBC)
library(tidyverse)

We will connect to SQL Server using Windows authentication.  We could use a pre-defined ODBC connection as well if we want.

In [None]:
conn <- odbcDriverConnect("driver={SQL Server};server=LOCALHOST;database=OutlierDetection;trusted_connection=true")

The following query gets the expenditure line item anme as well as actual amount from each transaction in the Wake County data set, filtering where the actual amount is greater than \$0.  Getting data at the expenditure line item name level is good enough for what we need to do, though you can try out other groupings on your own.

In [None]:
waketx <- sqlQuery(conn, "
SELECT
  ROUND(t.ActualAmount, 0) AS ActualAmount,
  eli.ExpenditureLineItemName
FROM Wake.WakeTransaction t
  INNER JOIN Wake.ExpenditureLineItem eli
    ON t.ExpenditureLineItemCode = eli.ExpenditureLineItemCode
WHERE
  ROUND(t.ActualAmount, 0) > 0
ORDER BY
  t.ActualAmount;"
)

First, let's look at the number of transactions by amount, looking for unexpected peaks.  Going back to the Wake County school board transporation fraud case, this kind of query might have exposed the fraud pretty quickly.

In [None]:
waketx.byAmount <- waketx  %>%
                      group_by(LogActualAmount = log(ActualAmount, 10)) %>%
                      summarize(n = n())
plot(waketx.byAmount)
lines(waketx.byAmount)

We can see several peaks, including one big one at a pretty low value (under \$10) and another more than \$10 but less than \$100.

In [None]:
waketx.byAmount %>%
  mutate(ActualAmount = 10^LogActualAmount) %>%
  arrange(desc(n))

Charting out the results, we see the spikes at \$2 and \$35, followed by \$165 and \$8.  What is so significant about \$35?

In [None]:
waketx %>%
  filter(ActualAmount == 35) %>%
  group_by(ExpenditureLineItemName) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) %>%
  head(5)

Inmate pay is the vast majority of \$35 outlays.  Is this the maximum amount an inmate might get paid?

In [None]:
inmate.pay <- waketx %>%
                filter(ExpenditureLineItemName == "INMATE PAY-SHERIFF DEPT ONLY") %>%
                group_by(ActualAmount) %>%
                summarize(n = n()) %>%
                arrange(desc(n))
inmate.pay %>% head(8)

In [None]:
plot(inmate.pay)

It clearly is not, although there is a big spike at that amount.  

Moving on, what about those \$2 expenditures?  Those seem kind of weird.

In [None]:
waketx %>%
  filter(ActualAmount == 2) %>%
  group_by(ExpenditureLineItemName) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) %>%
  head(8)

DISABILITY is the most common, followed by county-paid life insurance.  Let's look at the distribution of DISABILITY transaction fees.

In [None]:
disability <- waketx %>%
  filter(ExpenditureLineItemName == "DISABILITY") %>%
  group_by(ActualAmount) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) 
disability %>% head(8)

The most common amounts tend to be somewhere between \$2 and \$10.  These seem like petty cash expenditures.  But let's see if there are big-ticket items under this same category.

In [None]:
plot(x = log(disability$ActualAmount, 10), y = disability$n)

This is a base-10 log, so 2.0 = \$100 and 3.0 = \$1000.  As we noted, the most common amounts are well under \$10 (1.0), but there are some higher-priced payouts.

In [None]:
disability %>%
  arrange(desc(ActualAmount)) %>%
  head(15)

It's interesting that 14 of the 15 top amounts are unique.