Skip to content
aquilax1 edited this page Nov 10, 2017 · 1 revision

Introduction

A good database design is a fundamental requisite for achieving good data quality, thus it is important to put effort into the design of a database. However, maintaining the quality of an evolving database schema is a difficult challenge. If the schema is maintained by different developers, and there are no clear agreements on design principles, the quality of the schema design may degenerate over time. Possible issues that arise throughout development are missing foreign keys, different data types between columns in foreign keys, and redundant indices. As the schema size increases, the challenge of ensuring the quality by manual reviewing becomes both time-consuming and error-prone.

DBLint is an automated and DBMS independent tool for analyzing database designs. The main purpose of DBLint is to ensure consistency and maintainability of database designs by identifying bad design patterns. These patterns are expressed as rules that analyze a schema's metadata and the data stored in the database. DBLint's main output is an interactive report containing a list of all the discovered design issues.

To illustrate many of the pitfalls facing a database designer, consider the MySQL schema in the code example below containing the two tables users and posts.

create table users (
    user_name          varchar(32),
    id                 varchar(32) primary key,
    email1             varchar(32),
    email2             char(32) default '',
    last_visited_post  int,
    `msn`              varchar(0));

create table posts (
    date               datetime,
    post_id            int unique,
    subject            varchar(31),
    postText           varchar(1500),
    user_id            varchar(10),
    index ix_id (post_id),
    index ix_id_subject (post_id, subject),
    foreign key (user_id) references users(id));
   
alter table users add foreign key (last_visited_post) references posts(post_id);

The DBLint tool detects 15 design issues in this example, several non-obvious. The issues reported are the following, grouped by severity.

Critical: A table without a primary key, a varchar column of length zero, and different data types between a source and a target column.

High: A char column with the empty string as default value, and columns not following the naming convention.

Medium: Different data types in a sequence of columns (email1 and email2), inconsistent maximum length of varchar columns, a redundant index, too many nullable columns in the same table, and a nullable and unique column.

Low: An identifier containing a special character, a varchar column with too large maximum length, a cyclic dependency between the two tables, a column named with a reserved word from SQL, and a primary-key column not positioned first.

DBLint is envisioned to be used by developers to quickly catch common design errors, e.g., as part of a unit-test suite. Analyzing schema metadata is very fast and can be performed quickly to verify the design before deployment. A developer team can control the quality of a schema by expressing their design principles as rules in DBLint. Furthermore, DBLint can be used in a training environment to help newcomers to understand and avoid common design errors. DBLint's data analysis can be used on deployed databases to find data quality issues that arise when the system is used.

The three main principles of DBLint are: (1) Low configuration, which ensures a low initial cost of using DBLint. (2) Domain-independence, which ensures general applicability. (3) Rule extensibility, which enables DBLint to be adapted to specialized environments. An example of a low-configuration and domain-independent rule is the naming-convention checker that automatically discover the convention used and detects inconsistencies.

We introduce a metric score that summarizes the overall quality of a schema. The score is based on the design issues reported and a table-importance measure. The score can be used to compare different schema versions or design alternatives. DBLint also assigns a score to each table, such that the most problematic parts of the schema can be easily identified.

Clone this wiki locally