Skip to content
Permalink
Browse files
CASSANDRA-17639
patch by Josh McKenzie, Diogenese Topper; reviewed by -- for CASSANDRA-17639

Co-authored by: Josh McKenzie
Co-authored by: Diogenese Topper <diogenese@constantia.io>
  • Loading branch information
dtopdontstop authored and ErickRamirezAU committed May 19, 2022
1 parent f4d3119 commit b5bfe9433488ad696337f304b5008599fc1c29fe
Showing 3 changed files with 158 additions and 0 deletions.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@@ -8,6 +8,31 @@ NOTES FOR CONTENT CREATORS
- Replace post tile, date, description and link to you post.
////

//start card
[openblock,card shadow relative test]
----
[openblock,card-header]
------
[discrete]
=== The Path to Green CI
[discrete]
==== May 19, 2022
------
[openblock,card-content]
------
We reflect on our development journey, the work that goes into testing and pull out some numbers to demonstrate the level of testing that now goes into Apache Cassandra as we approach the GA of Cassandra 4.1.

[openblock,card-btn card-btn--blog]
--------
[.btn.btn--alt]
xref:blog/The-Path-to-Green-CI.adoc[Read More]
--------

------
----
//end card

//start card
[openblock,card shadow relative test]
----
@@ -0,0 +1,133 @@
= The Path to Green CI
:page-layout: single-post
:page-role: blog-post
:page-post-date: May, 19 2022
:page-post-author: Josh McKenzie
:description: Testing Apache Cassandra
:keywords:

:!figure-caption:

.Image credit: https://unsplash.com/@hasanalmasi[Hasan Almasi on Unsplash^]
image::blog/the-path-to-green-ci-unsplash-hasan-almasi.jpg[the path to Green CI]

As we approach the Cassandra 4.1 GA release, it’s a great time to stop and reflect on some of our development history, the past year since we released 4.0, and where we’re headed in the future. In this blog post, we’re going to limit our focus to testing Cassandra as that’s been a huge focus in the 4.0+ time frame.

=== The Numbers

But don’t just take my word for it - let’s start with some numbers! We’re going to use “lines of code” as a loose proxy for “where we’re spending our time”. This is definitely a fraught metric, but as _one_ way of viewing things it paints a pretty interesting picture, consistent with many of our intuitions.

Below is a list for our past three major releases and how much raw code exists in the following:

1. src/java: raw database code
2. test/unit: unit testing code
3. test/distributed: distributed tests in java code
4. The cassandra-dtest repo .py test files: python distributed tests using https://github.com/riptano/ccm[ccm^]

[cols=3*]
|=======
|||*% total*

|*3.0 Total:*|*316,262*|100.00%

|src|181,871|57.51%

|junit|100,284|31.71%

|jdtest|9,225|2.92%

|pdtest|24,882|7.87%

|All tests|134,391|42.49%
|=======

[cols=4*]
|=====
|||*% total*|*% change since 3.0*

|*4.0 Total:*|495,566|100.00%|156.69%

|src|261,825|52.83%|143.96%

|junit|172,672|34.84%|172.18%

|jdtest|21,769|4.39%|235.98%

|pdtest|39,300|7.93%|157.95%

|All tests|233,741|47.17%|
|=======
|=====

[cols=4*]
|=======
|||*% total*|*% change since 4.0*

|*4.1 Total:*|566,127|100.00%|114.24%

|src|297,685|52.58%|113.70%

|junit|197,231|34.84%|114.22%

|jdtest|31,306|5.53%|143.81%

|pdtest|39,905|7.05%|101.54%

|All tests|268,442|47.42%|
|=======

The biggest thing that immediately jumps out to me: our in-jvm dtests have been growing at a very strong pace relative to the rest of the code-base. Immense effort has gone into not just authoring this testing _framework_, but also adding new tests to it. As a percentage of our total code, there was a significant jump from 3.0 to 4.0 of almost a 5% relative increase in total test code to the entire codebase.

We can also infer that our new code addition to the database has _accelerated_ in the past year, as the delta from 4.0-4.1 represents a time frame of one calendar year and roughly 38k LoC net add vs. the 5.5 year gap between 3.0 and 4.0 with a net add of ~80k LoC. While the database also had a release line of 3.1-3.11 introducing new features and tests during this time window; for the sake of this analysis we’re only considering major traditional point releases.

So what does all this mean for us working on and depending on the project? It means that with the release of 4.1, _we have 10% more code just *unit* testing the database than we had in the entire database in 3.0_. It means that new development on Cassandra is accelerating. It also means we’re constantly moving the goalposts on what’s required to keep https://ci-cassandra.apache.org/[Green (passing) CI (continuous integration)^].

=== Keeping it Green

We all know Cassandra is an incredibly complex piece of software. The power to scale up linearly to petabytes of data on hundreds of machines, with zero downtime, in a masterless single logical cluster, where machines can drop out and in, https://cassandra.apache.org/doc/latest/cassandra/operating/hints.html[hint], https://cassandra.apache.org/doc/latest/cassandra/operating/read_repair.html[heal], and https://cassandra.apache.org/doc/latest/cassandra/operating/repair.html[repair] simply cannot be implemented without a significant amount of code, infrastructure, and testing to ensure it works as expected. Further, as we support hot upgrades with zero downtime between versions with mixed version clusters running, we have a strong commitment to backwards compatibility in the mix.

One of our struggles over time has been the software and hardware complexity required to keep our testing infrastructure “clean”, or green, on an ongoing basis. Balancing runtime, resourcing, and cost with a system as complex as Cassandra is a fixed challenge to begin with and is only growing over time as we’ve seen above.

We always drive down to stable 0 test failures at a GA release, however what we refer to as “flaky tests” sneak back into our suite over time. Let’s take a recent example, https://nightlies.apache.org/cassandra/ci-cassandra.apache.org/job/Cassandra-trunk/1112/[build run 1112^] on trunk (effectively Cassandra 4.1 pre-alpha).

19 test failures! https://nightlies.apache.org/cassandra/ci-cassandra.apache.org/job/Cassandra-trunk/1112/testReport/[Out of an entire suite of 49,704 tests makes that a 99.96% pass rate^]. Nobody wants a database that works 99.96% of the time, however, and that’s assuming we have 100% test coverage of not just all our code but also all possible combinations of state, a problem so daunting some contributors are https://issues.apache.org/jira/browse/CASSANDRA-15348[pushing the bleeding edge of the state of the art of distributed database testing^].

Burning down less than 20 flaky tests to get our release out between freeze and our goal for release, an eight-week window, is quite doable, so why not just continue to float along with a low number of test failures? Well, it gets more complicated when we look at _where_ we run our tests.

=== Circle vs. Jenkins

As https://cassandra.apache.org/_/development/testing.html[we outline in our contributor guide on testing], tests can both be run on https://ci-cassandra.apache.org/[Apache Jenkins infrastructure^] or on https://github.com/apache/cassandra/tree/cassandra-4.1/.circleci[CircleCI^]. The primary difference between these two systems are runtime, cost, and resources allocated to each individual test. While some contributors have access to paid CircleCI accounts that allow them to dedicate more resources to their test runs and shorten feedback loops, this is an open source volunteer project and our canonical CI is the Apache Jenkins infrastructure.

One challenge this introduces is tests that “flake” due to resource allocation differences. For instance, if you allocate a particularly intensive unit test to eight cores in a container with 16 gigs of RAM, you can expect a different runtime than allocating a container with two cores and eight gigs of RAM. Throw into the mix that all of us are doing development on different laptops, with different core counts, with different _architectures_, and you have a recipe for some pretty challenging non-deterministic test runtimes.

https://cwiki.apache.org/confluence/x/1AorCQ[Currently we accept both a Green run on CircleCI and a Green run on DevBranch on jenkins as acceptable for committers to merge code^]. This introduces a gap for us as the Circle plan can allocate more resources to containers for running tests based on the plan you use, meaning a test could pass on Circle that subsequently fails on ASF Jenkins due to resourcing limitations.

Another challenge we face is that it can be challenging to author tests in Cassandra that have deterministic results in the face of scheduling pressures. Given the long legacy of our project (which dates back to 2008), we have quite a bit of static state without existing stub implementations for testing, meaning many of our unit tests spin up state in other areas of the database, write files to disk, and otherwise mutate state in adjacent subsystems. https://github.com/apache/cassandra/blob/cassandra-4.1/.circleci/generate.sh#L41-L49[What this translates into is the need to run new tests 100 times on CI infrastructure before committing them, a practice we haven’t yet enshrined into our process^].

=== Testing Complex Systems is Itself Complex

The Cassandra testing ecosystem consists of a variety of different suites targeting different subsystems and operations in the database. From a high level, a look at the https://ci-cassandra.apache.org/job/Cassandra-trunk/[top level testing pipelines of the project^] shows standouts like testing with https://issues.apache.org/jira/browse/CASSANDRA-6809[compression^], with https://issues.apache.org/jira/browse/CASSANDRA-8844[change-data-capture^] enabled, during upgrades, both unit vs. distributed, etc. We have a https://ci-cassandra.apache.org/computer/[cluster of machines dedicated^] to testing Cassandra and tending to their needs is a significant task, all of which are https://github.com/apache/cassandra-builds/blob/trunk/ASF-jenkins-agents.md#current-agents[donated by different participants^] within the Apache Cassandra ecosystem.

Taking a quick look at the https://ci-cassandra.apache.org/job/Cassandra-trunk/1112/flowGraphTable/[runtime pipeline under the hood^], you can see the large distributed effort that it is to break down the different jobs across these different agents. https://github.com/apache/cassandra-builds/blob/trunk/jenkins-dsl/cassandra_job_dsl_seed.groovy[The code required to generate, distribute, build, collect logs from, teardown, and maintain^] all these jobs on these machines lives in the https://github.com/apache/cassandra-builds[cassandra-builds repo^] inside apache on github.

Throwing all this hardware and parallelization at our almost 50,000 tests takes our total test runtime *down to 4h 9m 4s*. A big shout-out to Mick Semb Wever, committer and PMC member on the project, who’s done a ton of work to get us this far with our CI infrastructure!

We have a few ideas for ways to reduce the total processing burden of our tests; with this much compute required and this many tests, small percentages add up to big gains. Jacek Lewandowski is targeting some file operations and general speedup in https://issues.apache.org/jira/browse/CASSANDRA-17427[CASSANDRA-17427^], Berenguer Blasi is looking into potentially re-using dtest clusters in our python dtests to cut out unnecessary cluster startup and shutdown times in https://issues.apache.org/jira/browse/CASSANDRA-16951[CASSANDRA-16951^], and after a little analysis I’ve uncovered that roughly 20% of our unit test runtime is comprised of 2.62% of our tests, giving us some low hanging fruit to potentially target to speed things up in https://issues.apache.org/jira/browse/CASSANDRA-17371[CASSANDRA-17371^].

Lastly, we have a Jenkins to JIRA integration script drafted that would auto update tickets with the results of the CI runs on ASF Jenkins infrastructure with the results of their build in https://issues.apache.org/jira/browse/CASSANDRA-17277?focusedCommentId=17493385&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17493385[CASSANDRA-17277^]. This is necessary as we have two paths for code to get certified for inclusion (circle or ASF Jenkins) with the former being more heavily resourced than the latter, but the latter being our gatekeeper.

=== The Future of Testing in Cassandra

As we head into the verification cycle for Cassandra 4.1 we’re going to be using https://cwiki.apache.org/confluence/x/tQzjBw[the same Release Lifecycle definitions^] we ratified back in 2019. Of note, we won’t transition from alpha to beta without green tests: _“No flaky tests - All tests (Unit Tests and DTests) should pass consistently. A failing test, upon analyzing the root cause of failure, may be “ignored in exceptional cases”, if appropriate, for the release, after discussion in the dev mailing list.”_

So we’re going to drive back to a green test board as we do for each major release, but are we going to make an effort to stay there and if so, how?

I’ve been working on this project since early 2014 (!), and this has always been a challenge for us. That said, after analyzing the numbers for this blog post and realizing _just how much_ we’re proportionally expanding our _testing_, I’m heartened by the progress we’re making; the proportion of flaky or failing tests is objectively falling over time. A total of 15 failing tests out of 50,000 is a lot less than 15 failing out of 25,000, or 12,500 for example, so we’re definitely moving in the right direction.

If we take the value of having a green test board as self-evident (developer time, triaging, branch stability, feedback loops, etc), how can we stay there after the 4.1 release? The combination of a bot letting us know ASAP if our patch correlates with a new test failure should help, as will lowering the total runtime required between running our tests and merging them. Lastly, in January of 2022 we introduced a new https://cwiki.apache.org/confluence/x/DI3kCw[Build Lead^] role to shepherd integration with our CI tracking system https://butler.cassandra.apache.org/#/[Butler^] which has had a very positive impact on our visibility of and momentum on fixing test failures.

We have a balanced tension between wanting to get code changes into the system rapidly for contributors fortunate enough to be able to use CircleCI while also providing for and encouraging usage of the freely available Apache Jenkins infrastructure, but we’re bridging the gap this naturally creates.

Contributors around the globe are working hard to get Cassandra 4.1 GA soon and just like Cassandra 4.0 before it, we expect this to be the most stable, best performing version of Apache Cassandra we’ve ever released. You can download the test build of Cassandra 4.1 https://nightlies.apache.org/cassandra/cassandra-4.1/Cassandra-4.1-artifacts/23/Cassandra-4.1-artifacts/[here^] and test it out - let us know what you think!

If you haven’t yet, come join the xref:community.adoc[Cassandra development community] and get involved in making the most scalable and available database in the world!

0 comments on commit b5bfe94

Please sign in to comment.