Skip to content

dolthub/grammar-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Grammar Crawler

Grammar Crawler is a configurable SQL fuzzer that works by crawling an ANTLR4 grammar to generate a wide variety of statements with valid syntax. It is database-agnostic, but ships with a copy of the MySQL 8 ANTLR grammar from Oracle's MySQL Workbench project to facilitate testing with MySQL's grammar.

Grammar Crawler was built to help expand the testing for Dolt DB and to help measure compliance with MySQL's syntax. Generated statements from Grammar Crawler are fed into a fork of SqlLogicTest to validate them against MySQL, and then all successful statements are added to the set of test statements that run against Dolt DB as part of nightly test runs. Grammar crawler is still in early development, but has already helped identify gaps in Dolt DB's MySQL syntax compliance.

Why build another SQL fuzzer?

SQL statement fuzzing is a well-established technique and there are many existing SQL fuzzers that are popular and work well. At DoltHub, we already use multiple tools with fuzzing features as part of our testing ( i.e. SqlLogicTest, DoltHub/fuzzer); Grammar Crawler takes a different approach and compliments those existing tools. It exploits the fact that there is a fully defined grammar for the syntax we want to support, and allows us to thoroughly explore that space.

Get Crawling 🐛

Before we cover any more details, let's see some quick examples of Grammar Crawler in action...

Example: Generate All Drop Statements

This first example shows how to configure the crawler to perform a complete crawl on the dropStatement rule in the MySQL grammar, with just a few rules skipped, completed statements output to stdout, and summary statistics printed out at the end.

Open up the DropStatementsGenerator class in the repo, run the main method, and you'll see results like this:

DROP UNDO TABLESPACE "text0";
DROP UNDO TABLESPACE "text1" ENGINE "text2" ENGINE "text3";
DROP UNDO TABLESPACE "text4" ENGINE "text5" ENGINE 'text6';
DROP UNDO TABLESPACE "text7" ENGINE "text8" ENGINE "text9";
DROP UNDO TABLESPACE "text10" ENGINE "text11" ENGINE `t5`;
...
DROP DATABASE IF EXISTS "text21636";
DROP DATABASE IF EXISTS `t3a33`;
DROP DATABASE IF EXISTS t3a34;

Total Statements: 14,900
Literal Element Coverage: 
 - Total:    47
 - Used:     47
 - Unused:   0
 - Frequent: 31
 - Coverage: 100.00%

Example: Generate Create Table Statements

This next example is a bit more involved. The syntax for create table statements is much more complex than the syntax for drop statements, so this example configures more skipped rules to prune down the crawl space. This example also uses a different CrawlStrategy – instead of crawling every possible path (i.e. exploring the full syntax space), it uses the CoverageAwareCrawlStrategy to try and make good decisions about whether to crawl a path in the grammar or not based on the current coverage of the grammar. It also demonstrates the configuration for crawler.setStatementLimit, which allows you to cap the maximum number of statements generated by the crawler. Finally, this example configures multiple StatementWriters – one to write statements to stdout and a second to write SqlLogicTest protos to a file, in a format ready to be consumed by SqlLogicTest.

Open up the CreateTableStatementsGenerator class in the repo, run the main method, and you'll see results similar to this:

CREATE TABLE `t1` (LIKE `c1`);
CREATE TABLE `t2` (LIKE c1);
CREATE TABLE `t3` LIKE `c1`;
...
CREATE TABLE `t4c4b3e` (c1 TIMESTAMP (12) UNIQUE, c2 CHAR VARYING (4) KEY) PASSWORD 'text7968898' ENGINE 'text7968899';
CREATE TABLE `t4c4b3f` (c1 TIMESTAMP (11) UNIQUE, c2 CHAR VARYING (4) KEY) PASSWORD 'text7968900' ENGINE `c3`;
CREATE TABLE `t4c4b40` (c1 TIMESTAMP (14) UNIQUE, c2 CHAR VARYING (4) KEY) PASSWORD 'text7968901' ENGINE c3;

Total Statements: 1,000,000
Literal Element Coverage: 
 - Total:    123
 - Used:     112
 - Unused:   11
 - Coverage: 91.06%

Crawler Components

Crawler Configuration

The crawler can be configured to skip parts of the grammar, to control how the crawler selects paths in the grammar graph to crawl, the maximum number of statements to generate, the output format for generated statements, and more.

Crawl Strategies

The crawler supports three crawl strategies:

  • Full Crawl – every path through the grammar graph will be explored, with some caveats (e.g. cycles are detected and skipped). This mode works well for small grammars or small subsets of a grammar, but can quickly produce a LOT of generated statement templates.
  • Random Crawl – this naive random crawl strategy will randomly select whether or not to continue crawling a path in the grammar graph. This is useful when you have a large solution space and only want a sample of generated statements.
  • Coverage-Aware Crawl – this crawl strategy uses information from the crawler on which literal elements have been used in completed statements and attempts to maximize coverage of the grammar by preferring paths through the grammar that will include unused literal elements.

Reification

After the crawler finishes a complete path through the grammar graph and has generated the template for a valid statement, it reifies the template into a valid SQL statement. This involves plugging in literal values for placeholders in the template and doing any minor cleanup to the statement to help increase the chances of it executing cleanly. This is a critical step for producing semantically valid statements instead of just syntactically valid statements. For example, the reification logic should ideally understand that when plugging in a literal value for a "DEFAULT " clause in the column specification of a table, the selected value needs to be compatible with the type of the column being defined. Reification logic in grammar crawler is still simplistic and needs further refinement in order to increase the crawler's ability to generate semantically correct statements.

Statement Output

After a statement has been reified, it is sent to one or more StatementWriter implementations configured for the Crawler. There are currently two implementations of StatementWriter available:

  • StdOutStatementWriter – This implementation simply outputs the statement directly to StdOut.
  • SQLLogicProtoStatementWriter – This implementation writes out a proto format suitable for SqlLogicTest to process.

Contributions

We're happy to accept contributions if you want to use the grammar crawler in your work and need additional behavior. Although we are focused exclusively on MySQL grammar compliance, there is nothing inherent in Grammar Crawler's design that would prevent it from working with other grammars.

Feel free to cut issues or Pull Requests or come join the Dolt DB Discord server and chat with us about ideas.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published