# Clean Data in SQL:  Constraints and Normalization

In this notebook, we will look at a data set created from a publically available CSV file.  We will then clean up this data set to make it easier for people to use.

First, let's take a look at the schema.


In [1]:
SELECT
    sc.column_id,
    sc.name,
    t.name AS DataType,
    sc.max_length
FROM sys.columns sc
    INNER JOIN sys.types t
        ON sc.system_type_id = t.system_type_id
WHERE
    sc.object_id = OBJECT_ID('dbo.Police')
ORDER BY
    sc.column_id;

<img src="https://raw.githubusercontent.com/feaselkl/TidyData/master/Images/PoliceSchema.png" width="1000" />



## Goals
1. Normalize the tables.  This means breaking out Incident Code (LCR) and Incident Description (LCR DESC) into their own table because LCR --> LCR DESC
2. Create primary and unique keys to guarantee uniqueness on tables and prevent duplicate rows from interfering with analysis.
3. Fix names and data types to make analysis easier.
4. Create foreign key constraints to give people an idea of how to link data together.
5. Create check constraints to ensure that invalid data cannot sneak in.

### Create incident Code

In [2]:
IF NOT EXISTS
(
    SELECT 1
    FROM sys.tables t
    WHERE
        t.name = N'IncidentCode'
)
BEGIN
    CREATE TABLE dbo.IncidentCode
    (
        IncidentCode VARCHAR(5) NOT NULL,
        IncidentDescription VARCHAR(55) NOT NULL
    );
    ALTER TABLE dbo.IncidentCode ADD CONSTRAINT [PK_IncidentCode]
        PRIMARY KEY CLUSTERED(IncidentCode);
END
GO

### Populate Incident Code

In [3]:
INSERT INTO dbo.IncidentCode
(
	IncidentCode,
	IncidentDescription
)
SELECT DISTINCT
	p.LCR,
	p.[LCR DESC]
FROM dbo.Police p
    LEFT OUTER JOIN dbo.IncidentCode ic
        ON p.LCR = ic.IncidentCode
WHERE
    ic.IncidentCode IS NULL;
GO

In [4]:
SELECT TOP(15)
    ic.IncidentCode,
    ic.IncidentDescription
FROM dbo.IncidentCode ic;

### Create a New Incident Table

This `Incident` table will replace the `Police` table as the "fact" table in our data set.  It includes critical information around incidents which occur, when they happen, and where they happen.

We add an `IncidentID` to give us an internal integer value for clustering, but we want to keep `INC NO` (now renamed to `IncidentNumber`) so we can tie back to our source system.

In [5]:
IF NOT EXISTS
(
	SELECT 1
	FROM sys.tables t
	WHERE
		t.name = N'Incident'
)
BEGIN
	CREATE TABLE dbo.Incident
	(
		IncidentID INT IDENTITY(1,1) NOT NULL,
		IncidentCode VARCHAR(5) NOT NULL,
		IncidentDate DATETIME,
		BeatID INT NOT NULL,
		IncidentNumber VARCHAR(19) NOT NULL,
		IncidentLocation GEOGRAPHY NULL
	);
	ALTER TABLE dbo.Incident ADD CONSTRAINT [PK_Incident]
		PRIMARY KEY CLUSTERED(IncidentID);
	ALTER TABLE dbo.Incident ADD CONSTRAINT [FK_Incident_IncidentCode]
		FOREIGN KEY(IncidentCode)
		REFERENCES dbo.IncidentCode(IncidentCode);
	ALTER TABLE dbo.Incident ADD CONSTRAINT [UKC_Incident]
		UNIQUE(IncidentNumber);
	ALTER TABLE dbo.Incident ADD CONSTRAINT [CK_Incident_BeatID]
		CHECK(BeatID >= 0);
	ALTER TABLE dbo.Incident ADD CONSTRAINT [CK_Incident_IncidentDate]
		CHECK(IncidentDate >= '2005-01-01' AND IncidentDate < '2015-01-01');
END
GO

### Populate Incident

Not all incident dates are legitimate dates, so we will use `TRY_PARSE` to perform this conversion.  Note that the `TRY_PARSE` function is quite slow in comparison to `TRY_CAST` or `TRY_CONVERT`, but it will work easily with our US-based data set and this is an operation we only need to do once.

For geographic conversions, we need to take care to check if the location field actually has data in it before attempting to convert it to a latitude and longitude pairing.

In [6]:
IF NOT EXISTS
(
    SELECT 1
    FROM dbo.Incident i
)
BEGIN
    INSERT INTO dbo.Incident
    (
        IncidentCode,
        IncidentDate,
        BeatID,
        IncidentNumber,
        IncidentLocation
    )
    SELECT
        p.LCR,
        TRY_PARSE(p.[INC DATETIME] AS DATETIME USING 'en-us') AS IncidentDate,
        p.BEAT,
        p.[INC NO],
        CASE
            WHEN NULLIF(p.[LOCATION], '') IS NULL THEN NULL
            ELSE GEOGRAPHY::STPointFromText('POINT' + REPLACE(p.[LOCATION], ',', ''), 4326)
        END AS IncidentLocation
    FROM dbo.Police p;
END
GO

## End Results

After performing these changes, our end result is a much clearer picture.  Querying the data is pretty straightforward and we also have protections built in to ensure that none of our data rules get violated.  Below is a sample query of the cleaned-up data set.

In [7]:
SELECT TOP(10)
    i.IncidentID,
    i.IncidentCode,
    ic.IncidentDescription,
    i.IncidentDate,
    i.BeatID
FROM dbo.Incident i
    INNER JOIN dbo.IncidentCode ic
        ON i.IncidentCode = ic.IncidentCode;