Grammar Crawler is a configurable SQL fuzzer that works by crawling an ANTLR4 grammar to generate a wide variety of statements with valid syntax. It is database-agnostic, but ships with a copy of the MySQL 8 ANTLR grammar from Oracle's MySQL Workbench project to facilitate testing with MySQL's grammar.
Grammar Crawler was built to help expand the testing for Dolt DB and to help measure compliance with MySQL's syntax. Generated statements from Grammar Crawler are fed into a fork of SqlLogicTest to validate them against MySQL, and then all successful statements are added to the set of test statements that run against Dolt DB as part of nightly test runs. Grammar crawler is still in early development, but has already helped identify gaps in Dolt DB's MySQL syntax compliance.
SQL statement fuzzing is a well-established technique and there are many existing SQL fuzzers that are popular and work well. At DoltHub, we already use multiple tools with fuzzing features as part of our testing ( i.e. SqlLogicTest, DoltHub/fuzzer); Grammar Crawler takes a different approach and compliments those existing tools. It exploits the fact that there is a fully defined grammar for the syntax we want to support, and allows us to thoroughly explore that space.
Before we cover any more details, let's see some quick examples of Grammar Crawler in action...
This first example shows how to configure the crawler to perform a complete crawl on the dropStatement
rule in the
MySQL grammar, with just a few rules skipped, completed statements output to stdout, and summary statistics printed
out at the end.
Open up the DropStatementsGenerator
class in the repo, run the main
method, and you'll see results like this:
DROP UNDO TABLESPACE "text0";
DROP UNDO TABLESPACE "text1" ENGINE "text2" ENGINE "text3";
DROP UNDO TABLESPACE "text4" ENGINE "text5" ENGINE 'text6';
DROP UNDO TABLESPACE "text7" ENGINE "text8" ENGINE "text9";
DROP UNDO TABLESPACE "text10" ENGINE "text11" ENGINE `t5`;
...
DROP DATABASE IF EXISTS "text21636";
DROP DATABASE IF EXISTS `t3a33`;
DROP DATABASE IF EXISTS t3a34;
Total Statements: 14,900
Literal Element Coverage:
- Total: 47
- Used: 47
- Unused: 0
- Frequent: 31
- Coverage: 100.00%
This next example is a bit more involved. The syntax for create table
statements is much more complex than the syntax
for drop
statements, so this example configures more skipped rules to prune down the crawl space. This example also
uses a different CrawlStrategy
– instead of crawling every possible path (i.e. exploring the full syntax space), it
uses the CoverageAwareCrawlStrategy
to try and make good decisions about whether to crawl a path in the grammar or not
based on the current coverage of the grammar. It also demonstrates the configuration for crawler.setStatementLimit
,
which allows you to cap the maximum number of statements generated by the crawler. Finally, this example configures
multiple StatementWriters – one to write statements to stdout and a second to write SqlLogicTest protos to a file,
in a format ready to be consumed by SqlLogicTest.
Open up the CreateTableStatementsGenerator
class in the repo, run the main
method, and you'll see results similar to
this:
CREATE TABLE `t1` (LIKE `c1`);
CREATE TABLE `t2` (LIKE c1);
CREATE TABLE `t3` LIKE `c1`;
...
CREATE TABLE `t4c4b3e` (c1 TIMESTAMP (12) UNIQUE, c2 CHAR VARYING (4) KEY) PASSWORD 'text7968898' ENGINE 'text7968899';
CREATE TABLE `t4c4b3f` (c1 TIMESTAMP (11) UNIQUE, c2 CHAR VARYING (4) KEY) PASSWORD 'text7968900' ENGINE `c3`;
CREATE TABLE `t4c4b40` (c1 TIMESTAMP (14) UNIQUE, c2 CHAR VARYING (4) KEY) PASSWORD 'text7968901' ENGINE c3;
Total Statements: 1,000,000
Literal Element Coverage:
- Total: 123
- Used: 112
- Unused: 11
- Coverage: 91.06%
The crawler can be configured to skip parts of the grammar, to control how the crawler selects paths in the grammar graph to crawl, the maximum number of statements to generate, the output format for generated statements, and more.
The crawler supports three crawl strategies:
- Full Crawl – every path through the grammar graph will be explored, with some caveats (e.g. cycles are detected and skipped). This mode works well for small grammars or small subsets of a grammar, but can quickly produce a LOT of generated statement templates.
- Random Crawl – this naive random crawl strategy will randomly select whether or not to continue crawling a path in the grammar graph. This is useful when you have a large solution space and only want a sample of generated statements.
- Coverage-Aware Crawl – this crawl strategy uses information from the crawler on which literal elements have been used in completed statements and attempts to maximize coverage of the grammar by preferring paths through the grammar that will include unused literal elements.
After the crawler finishes a complete path through the grammar graph and has generated the template for a valid statement, it reifies the template into a valid SQL statement. This involves plugging in literal values for placeholders in the template and doing any minor cleanup to the statement to help increase the chances of it executing cleanly. This is a critical step for producing semantically valid statements instead of just syntactically valid statements. For example, the reification logic should ideally understand that when plugging in a literal value for a "DEFAULT " clause in the column specification of a table, the selected value needs to be compatible with the type of the column being defined. Reification logic in grammar crawler is still simplistic and needs further refinement in order to increase the crawler's ability to generate semantically correct statements.
After a statement has been reified, it is sent to one or more StatementWriter
implementations configured for the
Crawler. There are currently two implementations of StatementWriter
available:
StdOutStatementWriter
– This implementation simply outputs the statement directly to StdOut.SQLLogicProtoStatementWriter
– This implementation writes out a proto format suitable for SqlLogicTest to process.
We're happy to accept contributions if you want to use the grammar crawler in your work and need additional behavior. Although we are focused exclusively on MySQL grammar compliance, there is nothing inherent in Grammar Crawler's design that would prevent it from working with other grammars.
Feel free to cut issues or Pull Requests or come join the Dolt DB Discord server and chat with us about ideas.