Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPIKE] [WIP] Refactor Elasticsearch Exporter tests #9258

Closed
wants to merge 16 commits into from
Closed

Conversation

npepinpe
Copy link
Member

@npepinpe npepinpe commented Apr 29, 2022

Description

This spike explores an idea to refactor the Elasticsearch exporter tests. The main goals here were:

  • Cover as much as possible via only unit tests. This gives us a faster feedback cycle when changing something - developers can focus on first fixing unit tests, which have smaller blast radius and faster execution, before even looking at integration tests.
  • Narrow the scope of the integration tests as much as possible. This meant getting rid of the broker, workflow engine semantics, etc., and just testing the integration of the exporter instance with Elasticsearch.

In general, I wanted to keep the same coverage as before. I think the coverage is now higher, but that was not the original motivation.

To do this, I introduced controllable test implementations of the exporter API under a new module exporter-tests. You can see in the first commit that I had some kind of test harness to simulate the exporter director, but I decided to drop it in the end. I think if we need it, we can introduce it later, and it doesn't seem all that necessary. I also added them as a new module and not in the test-util module because I would like to move away from catch-all modules like util and test-util. These not only complicate our dependency tree by being ubiquitous, but they're shallow modules without clear boundaries, high coupling to others, and low cohesion within themselves. However this isn't a strict requirement from my POV, so I'm happy to have this challenged. Note that in there I also started making use of Spotbugs annotations more heavily. This is something I would like to try but I'm still not convinced of the value, so I'm fine with dropping it (but would like to still try to see if we can get some value out of it).

I also made some changes to the ProtocolFactory, which generates random records. The main changes are: always produce positive longs, and make sure every build is reproducible. The first one is because our records have many long typed fields which are supposed to have timestamp semantics, where negative values are invalid. It would have been better if the types themselves carried those semantics (e.g. an Instance instead of long), but as it is, I think it's an acceptable workaround. This came about because Elastic expects these fields to have those semantics, and rejects records where a timestamp would be negative. The second change also came from practical use, though I should've seen it before. It's much more useful that every run of a test gives the same record. We're not doing property based tests here, but just want to have a record all filled out, and a test we can easily re-run. So we always construct the factory with a fixed seed. If used in property based tests, the seed can be provided to have different ones over time.

The rest of the changes are then only in the Elasticsearch exporter module. There were some production changes, notably splitting the ElasticsearchClient class and renaming it to ElasticClient. We could revert the renaming, but since we now use the official ElasticsearchClient provided by Elastic for test verifications, there were some ugly name collisions. But as I said, we can revert it, it's not that big a deal. The class was split into the following:

  • IndexRouter: a poorly named class whose responsibility is to compute the ID of a record, and the index of a record. It also can compute only, say, the prefix of the record. The goal is that it encapsulates the part of the exporter's API that defines index names and record IDs, both of which must be stable for consumers.
  • TemplateReader: reads templates from the resources, optionally substituting some of the properties for configuration properties, e.g. the index prefix, the number of shards, etc.
  • RestClientFactory: what the name says, takes in the ElasticsearchExporterConfiguration and produces a RestClient. This was helpful to also reuse the same factory in tests to create an official ElasticsearchClient using the same RestClient.

Splitting allows us to add more unit tests for each of these things, which was more complicated to do and verify with a bigger ElasticClient class. The ElasticClient class is now only responsible for buffering records in memory, flushing them as a bulk request, and sending the PUT /_index_template and PUT /_component_template/ requests, using the above classes. So essentially it mostly deals with wiring things together and communicating with Elastic. It could be further reduced to be honest to only Elastic communication, and we move the wiring things together to the exporter. It will be easier now to do this.

Two new DTOs were added:

  1. BulkRequestAction: a bulk request in Elastic is a list of newline delimited pairs of commands. Since we only index documents, each pair consists of an IndexAction and the document to index. The action specifies the ID, index name, and routing of the actual document. Using an explicit domain object instead of just Map simplifies maintenance, readability, and testability (much easier to assert against a known structure).
  2. Template: this represents a deserialized template. You can use the DTO when reading a component or index template from the resources via TemplateReader, or even when parsing the GetIndexTemplate request. This again simplifies writing assertions for tests.

Finally getting to the tests. There is no class hierarchy anymore, and most of the utilities were removed. What we have are now two utilities:

  • TestClient: a simple wrapper around both the official high level and low level REST client from Elastic. I initially used the clients directly, but this had two downsides. One is that the ElasticsearchClient is not close-able (yet it does own resources, e.g. threads, sockets, etc.), and it has some quirks which are not ideal if you're not familiar with Elastic. At times I found using the low level client simpler (for example, for GetIndexTemplate, this is much easier than the DTOs returned by the high level client).
  • TestSupport: the configuration we use, especially the IndexConfiguration, is nice for user input but terrible for programmatic access. Since we have one fixed field per value type, writing parameterized tests is a pain. We could use reflection, but that would be somewhat brittle. So TestSupport has some methods to deal with these things, e.g. disable indexing for value type X, or enable indexing for record type Y. Similarly, it also provides a convenience method to create a default Elastic container with some memory constraints and security settings.

After that we have mostly unit tests (everything ending in Test) and integration tests (everything ending in IT). The unit tests are, well, unit tests, not much to say there. I did extend the coverage a little bit, but I think it's not fully there. I mostly focused on keeping the same coverage as before at least.

The integration tests, there are three. Generally, I would propose always keeping a static container for Elastic which is reused by all tests, and having a unique prefix configured for each tests so the tests are isolated.

  • ElasticClientIT: tests the integration between the client and an Elastic instance. I think this is useful as it has an even narrower scope than the ElasticsearchExporterIT, since you just test client and database.
  • ElasticsearchExporterIT: tests the integration between the exporter and Elastic. It doesn't test all configuration, etc., but just focuses on making sure whatever is exported can be read back as expected, or that if we create the component template, it does exist (but it won't check its contents, that's covered by the client tests, for example).
  • FaulTolerantIT: here we have a standalone test class, because this one needs to start Elastic later, so it cannot be packaged with the other one. I can imagine we could add more tests here, but I just refactored the existing one. I don't really see how we could put it in the other IT class, unless we enforce some ordering, which I'd rather avoid.

Related issues

related to #

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • New content is added to the release announcement
  • If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Please refer to our review guidelines.

@npepinpe
Copy link
Member Author

npepinpe commented Apr 29, 2022

I'm confused as to why the code quality fails. I did have to change the checkstyle configuration, but locally it passes fine. Is the CI check not running with the latest configuration? Is it maybe pulling it again from the snapshots repository instead of reading it from the local one?

Yup, that's the one - we only validate but don't install, so the changes to build-tools aren't present in later modules. Unfortunately, if we go to install, then we will also run the checks in verify, which we don't want to do here. Any ideas @oleschoenburg or @remcowesterhoud ?

@npepinpe
Copy link
Member Author

npepinpe commented Apr 29, 2022

One nice outcome: the exporter tests use to take 8 minutes to run, and now take a little less than 3 minutes (half of which is just building the module) :)

And I think I actually increased the test coverage, so...

@npepinpe
Copy link
Member Author

Another GHA issue: it seems the unit tests don't find the new module I added. Why? 🤔

@lenaschoenburg
Copy link
Member

it seems the unit tests don't find the new module I added

Can you elaborate which modules you mean? I'm seeing that the new zeebe-exporter-test module is found: https://github.com/camunda/zeebe/runs/6232882989?check_suite_focus=true

@lenaschoenburg
Copy link
Member

Oh dear, it looks like the maven command to list all projects failed with some error and then the error message is used as input to the build matrix. For example: https://github.com/camunda/zeebe/runs/6232883105?check_suite_focus=true

That's not great! We should definitely fail the project list step if maven encounters errors!

@npepinpe
Copy link
Member Author

npepinpe commented May 2, 2022

Ah, so it found it, but then failed on some other project because the dependency was not installed? 🤔

@lenaschoenburg
Copy link
Member

The only failed jobs are the Java code formatting job, which is caused by what you described here: #9258 (comment)
Then a couple of jobs that are spawned with invalid input due to this: #9258 (comment)

and then two legitimate failures here: https://github.com/camunda/zeebe/runs/6232882455?check_suite_focus=true and here: https://github.com/camunda/zeebe/runs/6232874252?check_suite_focus=true

I'll fix the issue with the invalid project list. For the java code formatting job I think installing the build-tools module first would work and makes sense as a solution 👍 Do you want to do this in this PR?

@npepinpe
Copy link
Member Author

npepinpe commented May 2, 2022

Let's do it here and if it works I'll cherry pick the commit into a separate PR.

ghost pushed a commit that referenced this pull request May 2, 2022
9262: Install build-tools before validating r=npepinpe a=npepinpe

## Description

This PR ensures we install the build-tools module before validating the code base. This is important since the build-tools module contains configuration files for various tools used to verify our code base. Without installing, then these changes are not propagated to the downstream modules.

## Related issues

related to #9258



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
@npepinpe
Copy link
Member Author

npepinpe commented May 2, 2022

I can't think of a nice solution for the matrix project other than installing everything, but that's less than optimal 😅

Since this is the, what, third or fourth time we have issues with things not installed, what if we did the following:

  1. First job is build. This builds a clean repo into a local .m2 repository, such that everything we need is contained in the same workspace. Kind of like with Jenkins, though we don't need to worry too much about going offline or the likes.
  2. All other jobs depends on build and downloads the folder which contains its own .m2 repository.

This would help ensure dependencies from the same monorepo are always up to date, and might cut down a bit on the time for the other jobs that they spend on building again. WDYT?

@lenaschoenburg
Copy link
Member

Yeah, I've been playing around with a few variants and came to the same conclusion. I think we can use actions/cache here.

@npepinpe
Copy link
Member Author

npepinpe commented May 2, 2022

I tried with upload/download artifact, but oh boy, it takes ~6 minutes to upload the artifact 😅

@lenaschoenburg
Copy link
Member

Ouch :D Maybe we can restrict it to the io/camunda/ path? And maybe actions/cache is a bit faster?

@npepinpe
Copy link
Member Author

npepinpe commented May 2, 2022

Oof, it's actually still running, so I guess uploading it is a no go (https://github.com/camunda/zeebe/runs/6257745562?check_suite_focus=true)

I don't know. Maybe it's also because I'm using a self hosted runner? Would it be faster to upload (though slower to build) on a GitHub hosted runner? 🤔

@lenaschoenburg
Copy link
Member

Alternatively, we could just go back to finding maven modules by searching for pom.xml files. It'd be a bit faster and wouldn't require a mvn install. WDYT?

@npepinpe
Copy link
Member Author

npepinpe commented May 2, 2022

How was that - parsing the root POM file for the directory names, or just grabbing all pom.xml files? It does seem a lot more brittle unfortunately 😞

@lenaschoenburg
Copy link
Member

It was grabbing all pom.xml files. I agree that a asking maven for a list would be better but I don't really see how collecting a list of pom.xml files would be more brittle. Every module must always have a pom.xml file and finding them shouldn't ever fail, right?

@npepinpe
Copy link
Member Author

npepinpe commented May 2, 2022

We might potentially have POM files which are not part of the root/main module, so shouldn't be picked up. That's the only thing I can imagine 🤷‍♂️

@npepinpe
Copy link
Member Author

npepinpe commented May 2, 2022

If you tar the archive before, it's considerably faster. Takes < 2 minutes when the whole thing is tar'd, instead of who knows how long (after 10 minutes it was at 35% 😄)

@npepinpe
Copy link
Member Author

npepinpe commented May 2, 2022

OK so I fixed it for the unit tests, but I guess I failed to grab the correct target/ sub-folders since the distro wasn't there, which caused all stages that build docker images to fail. Wonder how I could figure out what it actually tar'd 🤔

@npepinpe
Copy link
Member Author

Closing as this has been split and incorporated into other PRs

@npepinpe npepinpe closed this May 14, 2022
@npepinpe npepinpe deleted the np-es-spike branch July 25, 2022 10:06
@Zelldon Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
version:8.1.0-alpha2 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants