Skip to content

Template for bespoke data validation using stored procedures and ANSI SQL on Snowflake

License

Notifications You must be signed in to change notification settings

cadgs/snow-data-validation

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

snow-data-validation

Template for bespoke data validation using stored procedures and ANSI SQL on Snowflake

Overview

Snowflake customers use a variety of tools as part of a larger solution to build enterprise grade, integrated systems that are automated and provide value to the business. Snowflake partners like Informatica, Matillion, dbt, Talend and Qlik provide data quality and data validation components as part of their tools. In many cases these companies have built in quality rules that check for data types, null values, proper encoding and other generic data rules that can be applied for multiple customers without prior knowledge of the data points. These powerful tools can be further customized for business specific data rules for bespoke validation of all data providing 100% validation coverage for data loaded into Snowflake. These tools are becoming cheaper and more powerful as time passes, but that doesn't always mean they are easy to deploy, manager, or easy to procure. Some organizations still have a litany of compliance factors that stand in the way even after other details have been worked out. Therefore, it's not always feasible to tools from the partner ecosystem especially if Snowflake can be leveraged to solve this problem.

To help Snowflake customers think about best practices for how they might implement a solution sans tools, we have developed a couple of examples that might help with developing or modifying these options for your own data warehouse. The Snowflake Professional Services team and Systems Integrators can assist with or implement processes like these. But it is worth noting that code provided as a template or as an example can assist customers in thinking through the process of implementing this pipeline or developing a strategy to support this. In traditional architectures, the limitations of compute and storage has forced focus of resources on the management of these resources; with Snowflake these responsibilities disappear and allow business to focus on devliering results with the data.


Methods

The following methods used to validate data have their own unique advantages and disadvantages. Consider these as examples for building your own pipelines are relevant to your own data assets. Data Validation In both of the following methods, data comes from an external source. There are three distinct routes to load data into Snowflake. One route is to be Extracted and Loaded through an ETL/ELT tool or custom application (e.g. python) and delivered to a staging table in Snowflake. Alternatively, data is loaded into a customer managed blob store (i.e. AWS S3, Azure Blob/ADLSG2, or GCP GCS) where Snowflake can be granted direct access to the relative location through an External Stage. Similarly, SnowSQL or the WebUI (with a compressed 50MB limitation for hte UI) can be used to bulk load data from an on-prem data source to a stage table. In any of those cases the primary objective is to load data into a "staging" or "landing" table in Snowflake. This table will collect all the data that is loaded into Snowflake.

The VALIDATE function can validate record level failures of past COPY commands. By default the COPY command will pass/fail a file that is loaded, but if the desire is to know which records have failed initial data loads then VALIDATE will be a good first stop towards

An External Table can be leveraged to preemptively access the data in a customer managed blob store where views and queries can validate the data before an attempted data load.

Once this strategy is defined for the business for a data load process then additional business specific data rules can then be applied. Again this would be a great place for additional tooling to come into play, but if our goal is to simplify the architecture footprint or there are constraints with procuring additional tools then consider the following methods as examples for moving forward.

Method 1 - Dynamic Common Table Expression (CTE) from Business Rules

Method 1 describes a stored procedure which is used to generate a Common Table Expression (CTE) for each of the business rules that are defined in the Validation Rules table. First a table is loaded with metadata about the Business Rule and the associated WHERE clause that will allow for the process to determine which rules fall outside the bounds defined by the rules. In the example provided we use Citibike Trips data that are leveraged as part of our online Zero to Snowflake labs. The data can be connected to via an External Stage to this S3 bucket URL (s3://snowflake-workshop-lab/citibike-trips).

The example business rules that we will conduct are:

  • bike rides cannot exceed 24 hours
  • Riders cannot be older than 150 years old
  • The starttime must be a valid date and time

Pros All business rules are kept in a single table repository. This centralizes all the rules and makes them easy to manage. Rules can be deprecated through a start/end date allowing them to naturally expire or manual obsolescence.

Cons Any time you create a dynamic process, the probability of error due to complications increases.

Method 2 - Declarative Pipeline through concrete Views

Pros Complicated business rules can be defined in more robust SQL. Cross database checks can be leveraged to include other data domains or lookup tables.

Cons The views, if not well managed, can create sprawl that makes the views difficult to understand or manage from a broader perspective. Managin the lifecycle of a view is a manual or external process based on metadata.

Example Data and Rules

Maintainers

Gabriel Mullen is a Senior Sales Engineer with Snowflake for the State, Local, and Education Team

Brock Cooper is a Senior Sales Engineer with Snowflake for the State, Local, and Education Team

This is a community-developed script, not an official Snowflake offering. It comes with no support or warranty. However, feel free to raise a github issue if you find a bug or would like a new feature.

Legal

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this tool except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

About

Template for bespoke data validation using stored procedures and ANSI SQL on Snowflake

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • PLpgSQL 100.0%