-
Notifications
You must be signed in to change notification settings - Fork 1
System Overview
DBLint has a layered architecture, shown in Figure 1, that divides the system's responsibilities into four low-coupled layers. The result is a flexible and maintainable architecture without ties to any concrete DBMS. The architecture is a non-strict layered architectures in that the UI Layer uses components from all three layers below it.
DBLint supports four common database systems, with minimal DBMS-specific code. New DBMSs can be supported by implementing an interface, effectively encapsulating all DBMS specific code. The rule system uses a plug-in architecture that decouples the rules from the rest of the system. This decoupling makes the system highly extensible and maintainable.
The architecture of DBLint is divided into two parts: DBLint and Data Sources. Data sources are external databases from which the Extractor retrieves the metadata and the data, which rules are checked against.
The boxes in DBLint on the figure represents separate layers. Each layer is further divided into sub-components such as Controller and Model Builder. As shown, rules are not considered a part of the DBLint core. Instead, the rules are loaded at run-time, represented by an arrow on the figure. The following sections describe the layers in more detail.
Figure 1: Architecture
The Extraction Layer's main responsibility is connecting to a database, extracting metadata, and allowing data access. The Extraction Layer is a generic database interface that allows DBLint to process databases from different DBMSs in a uniform way. It handles all differences between different DBMS systems, such as loading DBMS specific drivers and accessing metadata.
The Model Layer consists of the two components: Model and Model Builder. The Model component is a collection of classes used to represent, for example, tables, columns, and indices. The Model Builder instantiates these classes using metadata from the Extraction Layer, thereby constructing an in-memory database model. This model is the main data structure used when expressing rules. The model is a fast and convenient way of accessing the database structure while being completely DBMS independent.
Rules and providers are decoupled from DBLint and written as plug-ins, increasing the maintainability and flexibility of the architecture. The definitions of rules and providers are the following.
Rule Definition: A rule is an independent piece of code that reports design issues each time a violation of the rule occur. The rule examines the database design based on the in-memory database model. Furthermore, rules can access the data in the database. Rules are independent of each other, but can specify dependencies to providers.
Provider Definition: A provider is similar to a rule, but instead of reporting design issues, a provider exposes additional information about the schema or the data. Providers can be used when multiple rules need the same information about a schema. For instance, if two rules need information about table importance, a provider can run the analysis and expose the information to both rules.
New rules and providers can be added easily due to the plug-in architecture, even by third-party developers. This is especially useful for companies that have their own constraints, which need to be enforced by specialized rules.
The main task of the Controller is to schedule rules and providers for execution. Rules are scheduled to be executed in parallel if possible. The Controller resolves dependencies between rules and providers, and ensures the right order of execution. The Issue Collector collects and manages issues identified by the rules.
The UI Layer contains the user interface of DBLint. This user interface is used to: configure the database connection, select schemas and tables to be analyzed, select rules to be executed, show status of analysis during an execution, and finally generation of the report.
To make DBLint DBMS independent, the extraction layer specifies abstractions for creating database connection and extracting metadata and data. The extraction layer is designed as an instance of the abstract factory design pattern, using several abstractions from ADO.NET. The functionality required to support each DBMS is specified in an interface. This interface makes it easy to support a new DBMS, as the only requirement is to implement this interface and extending the factory to use the new implementation. The two prerequisites for supporting a DBMS are that (1) it has an ADO.NET data provider implementation and (2) the DBMS provides a mechanism for extracting metadata.
A generic model of the database is used to make DBLint DBMS independent, such that rules can analyze the database design regardless of which DBMS the metadata comes from.
The in-memory database model is an object representation of the database design that provides a uniform representation of database objects such as tables, columns, and referential constraints. The model support most common database concepts, however, the current implementation does not support UDT, check constraints or views.
The database model is a lightweight representation of a database; each table uses on average 20 kB of memory, including information about database objects such as columns, indices, and keys.
The SQL standard defines the information schema, which is a number of system tables containing relevant metadata about the database's definitions. Given the purpose of these standardized system tables, it should be easy to extract most of the data needed. However, most DBMSs have insufficient implementations of the standard, and extracting the full information requires querying DBMS specific system tables. In addition, the standard does not specify all the information needed to construct the model, e.g., index information.
Besides the differences in the structure of the system tables, there are also many differences in how metadata is represented. Boolean values, data-type definitions, referential constraints, update and delete rules, and so on, are often expressed differently. DBLint handles these differences in the extraction layer, where there is an implementation of the extraction interface for each DBMS. This makes the layers above the extraction layer DBMS independent.
Analyzing the data in a database gives a better insight into the usage of the database, than only analyzing metadata. However, data analysis poses the following challenges.
- Handling large data sets.
- Keeping a simple data interface.
- Limiting the number of data scans.
- Maintaining DBMS independence.
Challenge 1 is to consider the performance aspect when implementing data rules. This is in contrast to metadata rules that have a very small overhead. Furthermore, if a rule stores a local copy of each row from a database, then DBLint is effectively limited by main memory, therefore this should be avoided.
Challenge 2 is to maintain the plug-in type of rule definitions. Every complication of this interface, e.g., due to resource cleanup or thread synchronization, is a deviation from the principle that DBLint should be easy to extend.
Challenge 3 is to reduce the number of data scans, such that each row is only extracted once from the database. This is necessary to reduce the load on the database server as well as the network traffic. Furthermore, extracting each row only once may prove to be an optimization, if network bandwidth is a bottleneck.
Challenge 4 is to maintain DBMS independence, such that rules use the same interface regardless of DBMS. This is difficult due to different handling of data types depending on DBMS. Therefore, DBLint must translate rows from different DBMSs into a uniform format.
When a metadata rule is executed, it analyzes the complete database before returning. A data rule cannot examine the complete database in the same manner due to the possibly large data sets and multiple scans. Instead data rules only examine one table at a time. A data rule is therefore executed multiple times, once for each table in the database. This way, the Controller in the Rule Layer fully controls which tables are to be examined when.
We hypothesize that it is faster to limit examinations to one table at a time, than it is to analyze multiple tables at a time. The intuition behind this hypothesis, is that it allows the DBMS to fill its cache memory with data from that table alone, thus reducing reads from persistent storage.
Rules access data, using the construction shown in the code example below. Besides being relatively simple, this construction has the following advantages: it disposes the resources required to extract data, e.g. closing connections; it uses the well-known language constructs using and foreach; and finally it allows testing of different hypothesis about data extraction.
using (var rows = table.GetTableRows())
foreach (var row in rows)
{
if (row["user_name"] != ... //Snip
}
The simplest implementation of the GetTableRows method, would be to return a data reader of a SELECT * FROM
statement. It has the advantage of simplicity, but the disadvantage of all rules fetching all data from all tables.
In DBLint, rules are synchronized behind the scenes, such that only one data reader is created for each table, instead of having each rule creating a new data reader. The rows from this reader are distributed to all the rules enumerating the rows in the table. When a rule has processed enough rows to determine if there is an issue, it can cancel the enumeration by breaking out of the foreach loop. The using declarative then notifies the Row Fetcher, which then removes the rule from the set of rules enumerating the table.
Each rule can specify dependencies to providers, and each provider can specify dependencies to other providers. A provider cannot expose the additional information about the schema to rules, before it has been executed. Therefore the order of execution is important and simply executing all rules and providers in parallel is inadequate. DBLint handles this problem by scheduling rules and providers such that all dependencies are fulfilled automatically.
Dependencies among rules and providers are represented as a dependency graph, in which rules and providers are represented as vertices, and a dependency from rule A to provider B is represented by an arrow from A to B. This is illustrated by the example in Figure 2. There are two rules in this example, the missing primary-key rule without dependencies and the inconsistent naming-convention rule with a dependency to the table-importance provider.
Figure 2: Dependency Graph
To get a valid execution order, the dependency graph is sorted topologically, using a post-order depth-first traversal, starting at the nodes without incoming edges. A valid execution order is an order that satisfies the dependencies specified by rules and providers. Figure 3 shows one of the possible solutions when applying the algorithm to the example in Figure 2.
Figure 3: Dependency Graph Sorted
The execution order is parallelized such that rules and providers without interdependence are executed concurrently. For instance, in Figure 3 the first three nodes are executed in parallel, followed by node four and five.
Severities and thresholds in the default configuration of the rules are determined through an empirical evaluation of many schemas. However, all settings in DBLint are configurable as a mean to separate mechanisms from policies. If these thresholds and severities were hardcoded into the DBLint source code, it would be an argument against the usefulness of DBLint.