# Demo 06 - Round Numbers

In this demo, we will perform round number analysis on our data set.

For this demo, we will use pyodbc and ipython-sql.  pyodbc is an ODBC driver for Python, whereas ipython-sql allows you to use "sql magic" in Jupyter.  You can just as easily run the queries in SQL Server Management Studio if you prefer.

First, let's use pip to install pyodbc and ipython-sql and prep them for load.

In [None]:
!pip install pyodbc

To load pyodbc, we can use the **import** statement.

In [None]:
import pyodbc

In [None]:
!pip install ipython-sql

To use SQL magic, we will need to run the following load command.

In [None]:
%load_ext sql

From here on out, I can use the *%sql* command to run a single-line SQL command.  I can also use the *%%sql* command to run multi-line SQL commands.

The first thing I want to connect to the OutlierDetection database.  I have already created an ODBC connection pointing to localhost.OutlierDetection.  You do not need to use a pre-defined ODBC connection, but when connecting to SQL Server, I've found it easier to use a pre-defined connection.

In [None]:
%sql mssql+pyodbc://ForensicAccounting

Round Number analysis looks for the number of trailing zeroes before the decimal.  The idea here is that people might be rounding off values and pocketing the remainder, so a bill of \$41.08 might be rounded up to \$50.

We will break down transactions into types:  type 0, 1, 2, 3, and 4+.  A type 0 has zero trailing 0s, whereas a 4+ would have at least four trailing 0s.

**Examples:**

\$58 is a type 0.

\$108 is a type 0.

\$110 is a type 1.

\$34,000 is a type 3.

This particular query uses the CROSS APPLY operator to make the query a bit easier to understand.

In [None]:
%%sql
WITH records AS
(
	SELECT
		v.VendorName,
		a.RoundedAmount
	FROM dbo.LineItem li
		INNER JOIN dbo.Vendor v
			ON li.VendorID = v.VendorID
		CROSS APPLY
		(
			SELECT
				ROUND(li.Amount, 0) AS RoundedAmount
		) a
	WHERE
		a.RoundedAmount > 0
)
SELECT
	r.VendorName,
	SUM(t4.IsType4) AS Type4,
	SUM(t3.IsType3) AS Type3,
	SUM(t2.IsType2) AS Type2,
	SUM(t1.IsType1) AS Type1,
	SUM(t0.IsType0) AS Type0,
	COUNT(1) AS NumberOfInvoices,
	CAST(100.0 * SUM(t0.IsType0) / COUNT(1) AS DECIMAL(5,2)) AS PercentType0
FROM records r
	CROSS APPLY(SELECT CASE WHEN r.RoundedAmount % 10000 = 0 THEN 1 ELSE 0 END AS IsType4) t4
	CROSS APPLY(SELECT CASE WHEN t4.IsType4 = 0 AND r.RoundedAmount % 1000 = 0 THEN 1 ELSE 0 END AS IsType3) t3
	CROSS APPLY(SELECT CASE WHEN t3.IsType3 = 0 AND r.RoundedAmount % 100 = 0 THEN 1 ELSE 0 END AS IsType2) t2
	CROSS APPLY(SELECT CASE WHEN t2.IsType2 = 0 AND r.RoundedAmount % 10 = 0 THEN 1 ELSE 0 END AS IsType1) t1
	CROSS APPLY(SELECT CASE WHEN t4.IsType4 = 0 AND t3.IsType3 = 0 AND t2.IsType2 = 0 AND t1.IsType1 = 0 THEN 1 ELSE 0 END AS IsType0) t0
GROUP BY
	r.VendorName
ORDER BY
	PercentType0 DESC;

What we are doing here is rounding values first and then calculating the percent of values meeting each type criterion.  Glass and Sons has a large number of \\$999.99 records.  Those turn to \\$1000 after rounding, which explains the bevy of Type 3s.

This is sorted by the percent of records with no round numbers at the end.  In a realistic data set, there is a natural spread, and sometimes you will see "big round numbers" like we represent with Type 3 or Type 4.  The only batch of big round numbers is Glass and Sons, but we already have reason to be suspicious of their data.

Next up, lets look at the high-level stats across all vendors.

In [None]:
%%sql
WITH records AS
(
	SELECT
		v.VendorName,
		a.RoundedAmount
	FROM dbo.LineItem li
		INNER JOIN dbo.Vendor v
			ON li.VendorID = v.VendorID
		CROSS APPLY
		(
			SELECT
				ROUND(li.Amount, 0) AS RoundedAmount
		) a
	WHERE
		a.RoundedAmount > 0
)
SELECT
	SUM(t4.IsType4) AS Type4,
	SUM(t3.IsType3) AS Type3,
	SUM(t2.IsType2) AS Type2,
	SUM(t1.IsType1) AS Type1,
	SUM(t0.IsType0) AS Type0,
	COUNT(1) AS NumberOfInvoices,
	CAST(100.0 * SUM(t0.IsType0) / COUNT(1) AS DECIMAL(5,2)) AS PercentType0
FROM records r
	CROSS APPLY(SELECT CASE WHEN r.RoundedAmount % 10000 = 0 THEN 1 ELSE 0 END AS IsType4) t4
	CROSS APPLY(SELECT CASE WHEN t4.IsType4 = 0 AND r.RoundedAmount % 1000 = 0 THEN 1 ELSE 0 END AS IsType3) t3
	CROSS APPLY(SELECT CASE WHEN t3.IsType3 = 0 AND r.RoundedAmount % 100 = 0 THEN 1 ELSE 0 END AS IsType2) t2
	CROSS APPLY(SELECT CASE WHEN t2.IsType2 = 0 AND r.RoundedAmount % 10 = 0 THEN 1 ELSE 0 END AS IsType1) t1
	CROSS APPLY(SELECT CASE WHEN t4.IsType4 = 0 AND t3.IsType3 = 0 AND t2.IsType2 = 0 AND t1.IsType1 = 0 THEN 1 ELSE 0 END AS IsType0) t0;

The percentage of non-type 0 records (that is, records whose last digit is zero) is roughly 14%.  If this were a true uniform distribution, we'd expect 10%.  That's a little higher than we'd normally expect.

The cause appears to come from the type 3, where we'd expect 1/10^3 = 1/1,000 = 0.1% if digit endings were strictly uniform.  0.1% of 31,879 is approximately 32 transactions.  The fact that we have 1449 lends more credence to there being something fishy.  Want another reason to be suspicious?  Over 8 years of data, here are all of the Type 3s.  See if you spot a pattern.

In [None]:
%%sql
SELECT
	v.VendorName,
	li.Amount,
	a.RoundedAmount
FROM dbo.LineItem li
	INNER JOIN dbo.Vendor v
		ON li.VendorID = v.VendorID
	CROSS APPLY
	(
		SELECT
			ROUND(li.Amount, 0) AS RoundedAmount
	) a
WHERE
	a.RoundedAmount > 0
	AND a.RoundedAmount % 1000 = 0
ORDER BY
	v.VendorName;