# Demo 03 - Duplicates and Cardinality

In the first notebook, we saw some fishy behavior with respect to increases in expenses in the year 2018 versus prior years.  Now let's use a couple tools at our disposal to delve further into the problem.

In [1]:
if(!require(tidyverse)) {
    install.packages("tidyverse", repos = "http://cran.us.r-project.org")
    library(tidyverse)
}

if(!require(odbc)) {
    install.packages("odbc", repos = "http://cran.us.r-project.org")
    library(odbc)
}

if(!require(data.table)) {
  install.packages("data.table", repos = "http://cran.us.r-project.org")
  library(data.table)
}

Loading required package: tidyverse
"package 'tidyverse' was built under R version 3.5.2"-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.1.0     v purrr   0.2.5
v tibble  1.4.2     v dplyr   0.7.6
v tidyr   0.8.1     v stringr 1.3.1
v readr   1.1.1     v forcats 0.3.0
"package 'ggplot2' was built under R version 3.5.2"-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
Loading required package: odbc
"package 'odbc' was built under R version 3.5.3"Loading required package: data.table

Attaching package: 'data.table'

The following objects are masked from 'package:dplyr':

    between, first, last

The following object is masked from 'package:purrr':

    transpose



## Count of Invoices by Day

The first thing we want to do is see how many invoices we get per bus, vendor, and day.  Our organization requires vendors send one invoice per bus maintenance item, so we generally expect no more than one invoice per combination of bus, vendor, and day.  After all, invoices can get lost in the shuffle and creating more than one is inefficient.

In [2]:
conn <- DBI::dbConnect(odbc::odbc(), 
                      Driver = "SQL Server", 
                      Server = "localhost", 
                      Database = "ForensicAccounting", 
                      Trusted_Connection = "True")

In [3]:
dupes <- DBI::dbGetQuery(conn, "WITH records AS
(
	SELECT
		li.LineItemDate,
		li.BusID,
		li.VendorID,
		COUNT(*) AS NumberOfInvoices
	FROM dbo.LineItem li
	GROUP BY
		li.LineItemDate,
		li.BusID,
		li.VendorID
)
SELECT
	NumberOfInvoices,
	COUNT(*) AS NumberOfOccurrences
FROM records
GROUP BY
	NumberOfInvoices
ORDER BY
	NumberOfInvoices;")

In [4]:
dupes

NumberOfInvoices,NumberOfOccurrences
1,37224
2,118
3,1


It looks like for the vast majority of the time, we see one invoice per bus, vendor, and day.  118 times we have 2 invoices and we got three invoices on a single day once.  It might be interesting to see who's sending us multiple invoices so let's do that.

In [5]:
dupeSenders <- DBI::dbGetQuery(conn, "WITH records AS
(
	SELECT
		li.LineItemDate,
		li.BusID,
		li.VendorID,
		COUNT(*) AS NumberOfInvoices
	FROM dbo.LineItem li
	GROUP BY
		li.LineItemDate,
		li.BusID,
		li.VendorID
)
SELECT
	VendorID,
	COUNT(*) AS NumberOfOccurrences
FROM records
WHERE
	NumberOfInvoices > 1
GROUP BY
	VendorID
ORDER BY
	VendorID;")

In [6]:
dupeSenders

VendorID,NumberOfOccurrences
2,11
5,59
6,1
7,2
8,4
9,16
10,4
11,1
12,6
13,6


Vendors 2, 5, and 9 have double-digit counts of double-invoice days, but vendor 5 has four times as many days as the next-highest.  That's a little weird and worth keeping in the back of our minds, but it's not outlandish.  So let's keeep digging.

## Cardinality Checks

We can use the `rapply()` function to perform cardinality checks, showing us how many distinct values there are in our data set.

In [7]:
lineItems <- DBI::dbGetQuery(conn, "SELECT
	*
FROM dbo.LineItem li;")

In [8]:
rapply(lineItems, function(x) { length(unique(x)) })

We knew that there were 15 vendors, 28 expense categories, and 12 employees, so those aren't surprising.  We do see 664 buses which have had maintenance of some sort done on them.  That means 36 buses were retired without ever having gone through maintenance.

Cardinality is also useful when looking at subsets of data.  For example, let's filter to include just invoices valued between \$850 and \$999.99, as these are high-value invoices which fall below the two-signer rule.

In [9]:
highValueInvoices <- lineItems %>% dplyr::filter(Amount >= 850 & Amount < 1000)

In [10]:
rapply(highValueInvoices, function(x) { length(unique(x)) })

It looks like 12 of our 15 vendors invoices between \$850 and \$999.99.  We can dig deeper using the `setDT()` function in the `data.table` package.  Let's look at counts by vendor ID:

In [11]:
data.table::setDT(highValueInvoices)[, .N, keyby=VendorID]

VendorID,N
1,75
2,22
5,525
6,48
7,46
8,42
9,12
11,72
12,10
13,11


Vendor 5 seems to keep coming up as an outlier.  Maybe they are special but if your spidey-senses are tingling, I don't blame you.

Maybe they just have a lot of high-value items, so let's see how many invoices over \$1000 they have in our data set.

In [12]:
data.table::setDT(filter(lineItems, Amount > 1000))[, .N, keyby=VendorID]

VendorID,N
1,171
2,7
6,37
7,105
8,101
9,6
11,161
12,7
13,11
14,157


That's strange.  They have no invoices over \$1000.  How about ones which are just under \$1000?  Here we will focus specifically on vendor 5 and look at amounts greater than \$995.

In [13]:
data.table::setDT(filter(lineItems, VendorID == 5 & Amount > 995))[, .N, keyby=Amount]

Amount,N
996.06,1
997.25,1
997.43,1
999.29,1
999.99,411


415 invoices all happen to be within \$5 of our two-signer cutoff?  And 411 of those happen to be one penny short?  The circumstantial evidence is starting to add up.  We don't have anything yet but this is looking very suspicious.

Now let's pivot and look at high-value invoices by employee.

In [14]:
data.table::setDT(highValueInvoices)[, .N, keyby=EmployeeID]

EmployeeID,N
1,42
2,37
3,35
4,146
5,38
6,46
7,36
8,155
9,51
10,184


All twelve of our employees have dealt with high-value invoices.  Let's see what it looks like when we filter on the suspicious vendor.

In [15]:
data.table::setDT(filter(lineItems, VendorID == 5 & Amount > 995))[, .N, keyby=EmployeeID]

EmployeeID,N
4,80
8,104
10,123
12,108


Only four employees handled those invoices for vendor 5.  But maybe the agency has people focus on certain sets of vendors.  Let's limit ourselves to the year 2018 and see how many invoices for vendor 5 each employee has handled.

In [16]:
data.table::setDT(filter(lineItems, VendorID == 5 & year(LineItemDate) == 2018))[, .N, keyby=EmployeeID]

EmployeeID,N
1,24
2,22
3,21
4,610
5,25
6,21
7,26
8,666
9,19
10,667


All 12 have handled invoices.  8 of the 12 have taken a couple dozen but our final four have over 600 apiece.  That's averaging 2 invoices per employee per day.  That's a lot of invoices!