# Demo 2 - Gaps

In this demo, we will look at gaps in data.  Looking at gaps in data is important because it shows us where we might be missing individual data points.  If this is a set which should be complete, such as check numbers or invoice numbers, it might give you a telling indicator that there might be an issue.

For this demo, we will use pyodbc and ipython-sql.  pyodbc is an ODBC driver for Python, whereas ipython-sql allows you to use "sql magic" in Jupyter.  You can just as easily run the queries in SQL Server Management Studio if you prefer.

First, let's use pip to install pyodbc and ipython-sql and prep them for load.

In [None]:
!pip install pyodbc

To load pyodbc, we can use the **import** statement.

In [None]:
import pyodbc

In [None]:
!pip install ipython-sql

To use SQL magic, we will need to run the following load command.

In [None]:
%load_ext sql

From here on out, I can use the *%sql* command to run a single-line SQL command.  I can also use the *%%sql* command to run multi-line SQL commands.

The first thing I want to connect to the OutlierDetection database.  I have already created an ODBC connection pointing to localhost.OutlierDetection.  You do not need to use a pre-defined ODBC connection, but when connecting to SQL Server, I've found it easier to use a pre-defined connection.

In [None]:
%sql mssql+pyodbc://OutlierDetection

I am going to create a table called #Invoices.  This table will hold our invoices, with invoice number and amount.  Our assumption is that InvoiceNumber is an ever-increasing, contiguous integer value.  I will insert some records that do contain gaps.

The *DROP IF EXISTS* syntax was introduced in SQL Server 2016.  If you are using a version of SQL Server prior to 2016, you can remove this line and manually drop the table if need be (or re-write this to check for the table's existence first).

In [None]:
%%sql DROP TABLE IF EXISTS #Invoices;

CREATE TABLE #Invoices
(
	InvoiceNumber INT NOT NULL,
	InvoiceAmount DECIMAL(13, 2) NOT NULL
);

INSERT INTO #Invoices
(
	InvoiceNumber,
	InvoiceAmount
)
VALUES
	(1, 500),
	(2, 600),
	(3, 150),
	(5, 175),
	(8, 209),
	(9, 1305);


Next up, we will run a query which looks for gaps in the data.  Note that this uses syntax which Microsoft introduced in SQL Server 2012.  If you are using SQL Server 2008 R2 or earlier, you would need to use a different set of syntax to solve the gaps and islands problem.

In [None]:
%%sql WITH C AS
(
	SELECT
		InvoiceNumber AS CurrentInvoiceNumber,
		LEAD(InvoiceNumber) OVER (ORDER BY InvoiceNumber) AS NextInvoiceNumber
	FROM #Invoices
)
SELECT
	CurrentInvoiceNumber + 1 AS rangestart,
	NextInvoiceNumber- 1 AS rangeend
FROM C
WHERE
	NextInvoiceNumber - CurrentInvoiceNumber > 1;


As an alternative to gap analysis, you can join to a tally (or numbers) table and look for cases in which there is no matching result in the invoices table.

This query first builds up a tally table, which is extremely fast for values up to around 10,000 records.  After about 10,000 tables, it might make more sense to build a tally table and store it on disk.

In [None]:
%%sql WITH
	t1 AS (SELECT 1 N UNION ALL SELECT 1 N),
	t2 AS (SELECT 1 N FROM t1 x, t1 y),
	t3 AS (SELECT 1 N FROM t2 x, t2 y),
	t4 AS (SELECT 1 N FROM t3 x, t3 y),
	Tally AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) N FROM t4 x, t4 y)
SELECT
	t.N
FROM Tally t
	LEFT OUTER JOIN #Invoices i
		ON t.N = i.InvoiceNumber
WHERE
	i.InvoiceNumber IS NULL
	AND t.N >= (SELECT MIN(InvoiceNumber) FROM #Invoices imin)
    AND t.N <= (SELECT MAX(InvoiceNumber) FROM #Invoices imax);