Skip to content
Browse files
Add history builder and an ability to write unit-tests with Harry
Implement a full repair test
  • Loading branch information
ifesdjeen committed Oct 1, 2021
1 parent 83dd5a6 commit f6b4df664b5ec79cf555aa0fb34e26f40fd9e9cd
Showing 54 changed files with 2,330 additions and 727 deletions.
@@ -1,9 +1,171 @@
# Harry, a fuzz testing tool for Apache Cassandra

Project aims to generate _reproducible_ workloads that are as close to real-life
The project aims to generate _reproducible_ workloads that are as close to real-life
as possible, while being able to _efficiently_ verify the cluster state against
the model without pausing the workload itself.

# Introduction

Harry has two primary modes of functionality:

* Unit test mode: in which you define specific sequences of
operations and let Harry test these operations using different
schemas and conditions.
* Exploratory/fuzz mode: in which you define distributions of events
rather rather than sequences themselves, and let Harry try out
different things.

Usually, in unit-test mode, we’re applying several write operations to
the cluster state and then run different read queries and validate
their results. To learn more about writing unit tests, refer to the "Writing
Unit Tests" section.

In exploratory mode, we continuously apply write operations to the
cluster and validate their state, allowing data size to grow and simulating
real-life behaviour. To learn more about implementing test cases using
fuzz mode, refer to the "Implementing Tests" section of this guide, but it's likely
you'll have to read the rest of this document to implement more
complex scenarios.

# Writing Unit Tests

To write unit tests with Harry, there's no special knowledge required.
Usually, unit tests are written by simply hardcoding the schema and then writing
several modification statements one after the other, and then manually validating results
of a `SELECT` query. This might work for simple scenarios, but there’s still a chance
that for some other schema or some combination of values the tested feature may not work.

To improve the situation, we can express the test in more abstract
terms and, instead of writing it using specific statements, we can
describe which statement _types_ are to be used:

test(new SchemaGenerators.Builder("harry")
.partitionKeySpec(1, 5)
.clusteringKeySpec(1, 5)
.regularColumnSpec(1, 10)
historyBuilder -> {

This spec can be used to generate clusters of different sizes,
configured with different schemas, executing the given a sequence of
actions both in isolation, and combined with other randomly generated
ones, with failure-injection.

Best of all is that this test will _not only_ ensure that such a
sequence of actions does not produce an exception, but also would
ensure that cluster will respond with correct results to _any_ allowed
read query.

`HistoryBuilder` is using the configuration provided by Harry `Run`, which
can be written either using `Configuration#ConfigurationBuilder` like
we did
or provided in a [yaml file](

To begin specifying operations for a new partition,
`HistoryBuilder#nextPartition` has to be called, which returns a
`PartitionBuilder`. For the commands within this partition, you have
a choice between:
* `PartitionBuilder#simultaneously`, which will execute listed
operations with the same timestamp
* `PartitionBuilder#sequentially`, which will execute listed operations
with monotonically increasing timestamps, giving each operation its
own timestamp

Similarly, you can choose between:
* `PartitionBuilder#randomOrder`, which will execute listed operations
in random order
* `PartitionBuilder#strictOrder`, which will execute listed operations
in the order specified by the user

The rest of operations are self-explanatory: `#insert`, `#update`,
`#delete`, `#columnDelete`, `#rangeDelete`, `#sliceDelete`,
`#partitionDelete`, and their plural counterparts.

After history generated by `HistoryBuilder` is replayed using
`ReplayingVisitor`, you can use any model (`QuiescentChecker` by
default) to validate queries. Queries can be provided manually or
generated using `QueryGenerator` or `TypedQueryGenerator`.

# Basic Terminology

* Inflate / inflatable: a process of producing a value (for example, string, or a blob)
from a `long` descriptor that uniquely identifies the value.
See [data generation]( section
of this guide for more details.
* Deflate / deflatable: a process of producing the descriptor the value was inflated
from during verification. See [model](
section of this guide for more details.

For definitions of logical timestamp, descriptor, and other entities used during
inflation and deflation, refer to [formal relationships](

# Features

Currently, Harry can exercise the following Cassandra functionality:

* Supported data types: `int8`, `int16`, `int32`, `int64`, `boolean`, `float`,
`double`, `ascii`, `uuid`, `timestamp`. Collections are only _inflatable_.
* Random schema generation, with an arbitrary number of partition and clustering
* Schemas with arbitrary `CLUSTERING ORDER BY`
* Randomly generated `INSERT` and `UPDATE` queries with all columns or arbitrary
column subset
* Randomly generated `DELETE` queries: for a single column, single row, or
a range of rows
* Inflating and validating entire partitions (with allowed in-flight queries)
* Inflating and validating random `SELECT` queries: single row, slices (with single
open end), and ranges (with both ends of clusterings specified)

Inflating partitions is done using [Reconciler](
Validating partitions and random queries can be done using [Quiescent Checker](
and [Exhaustive Checker](

## What's missing

Harry is by no means feature-complete. Main things that are missing are:

* Some types (such as collections) are not deflatable
* Some types are implemented but are not hooked up (`blob` and `text`) to DSL/generator
* Partition deletions are not implemented
* 2i queries are not implemented
* Compact storage is not implemented
* Static columns are not implemented
* Fault injection is not implemented
* Runner and scheduler are rather rudimentary and require significant rework and proper scheduling
* TTL is not supported
* Some SELECT queries are not supported: `LIMIT`, `IN`, `GROUP BY`, token range queries
* Partition deletions are not implemented
* Pagination is not implemented

Some things, even though are implemented, can be improved or optimized:

* RNG should be able to yield less than 64 bits of entropy per step
* State tracking should be done in a compact off-heap data stucture
* Inflated partition state and per-row operation log should be done in a compact
off-heap data structure
* Exhaustive checker can be significantly optimized
* Harry shouldn't rely on java-driver for query generation
* Exhaustive checker should use more precise information from data tracker, not
just watermarks
* Decision-making about _when_ we visit partitions and/or rows should be improved

This list of improvements is incomplete, and should only give the reader a rough
idea about the state of the project. Main goal for the initial release was to make it
useful, now we can make it fast and feature-complete!

# Goals

_Reproducibility_ is achieved by using the PCG family of random number
generators and generating schema, configuration, and every step of the workload
from the repeatable sequence of random numbers. Schema and configuration are
@@ -58,6 +220,60 @@ visited for a logical timestamp, how many operations there will be in batch,
what kind of operations there will and how often each kind of operation is going
to occur.

# Implementing Tests

All Harry components are pluggable and can be redefined. However, many
of the default implementations will cover most of the use-cases, so in
this guide we’ll focus on ones that are most often used to implement
different use cases:

* System Under Test: defines how Harry can communicate with
Cassandra instances and issue common queries. Examples of a system
under test can be a CCM cluster, a “real” Cassandra cluster, or an
in-JVM dtest cluster.
* Visitor: defines behaviour that gets triggered at a specific
logical timestamp. One of the default implementations is
MutatingVisitor, which executes write workload against
SystemUnderTest. Examples of a visitor, besides a mutating visitor,
could be a validator that uses the model to validate results of
different queries, a repair runner, or a fault injector.
* Model: validates results of read queries by comparing its own
internal representation against the results returned by system
under test. You can find three simplified implementations of
model in this document: Visible Rows Checker, Quiescent Checker,
and an Exhaustive Checker.
* Runner: defines how operations defined by visitors are
executed. Harry includes two default implementations: a sequential
and a concurrent runner. Sequential runner allows no overlap
between different visitors or logical timestamps. Concurrent
runner allows visitors for different timestamps to overlap.

System under test is the simplest one to implement: you only need a
way to execute Cassandra queries. At the moment of writing, all custom
things, such as nodetool commands, failure injection, etc, are
implemented using a SUT / visitor combo: visitor knows about internals
of the cluster it is dealing with.

Generally, visitor has to follow the rules specified by
DescriptorSelector and PdSelector: (it can only visit issue mutations
against the partition that PdSelector has picked for this LTS), and
DescriptorSelector (it can visit exactly
DescriptorSelector#numberOfModifications rows within this partition,
operations have to have a type specified by #operationKind, clustering
and value descriptors have to be in accordance with
DescriptorSelector#cd and DescriptorSelector#vds). The reason for
these limitations is because model has to be able to reproduce the
exact sequence of events that was applied to system under test.

Default implementations of partition and clustering descriptors, used
in fuzz mode allow to optimise verification. For example, it is
possible to go find logical timestamps that are visiting the same
partition as any given logical timestamp. When running Harry in
unit-test mode, we use a special generating visitor that keeps an
entire given sequence of events in memory rather than producing it on
the fly. For reasons of efficiency, we do not use generating visitors
for generating and verifying large datasets.

# Formal Relations Between Entities

To be able to implement efficient models, we had to reduce the amount of state
@@ -85,7 +85,7 @@ clustering_descriptor_selector:
# and model state.
- logging:
mutating: {}
@@ -92,7 +92,7 @@ clustering_descriptor_selector:
# and model state.
- logging:
mutating: {}

0 comments on commit f6b4df6

Please sign in to comment.