Skip to content

SOLR-16812: Support CBOR format for update/query#1655

Merged
noblepaul merged 16 commits intoapache:mainfrom
cowpaths:noble/wt_cbor
Jun 5, 2023
Merged

SOLR-16812: Support CBOR format for update/query#1655
noblepaul merged 16 commits intoapache:mainfrom
cowpaths:noble/wt_cbor

Conversation

@noblepaul
Copy link
Contributor

No description provided.

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool to see this new format!

import org.apache.solr.handler.loader.CborStream;
import org.apache.solr.response.XMLResponseWriter;

public class TestCborDataFormat extends SolrCloudTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO our request/response formats, to include this new one, should be tested in a general way without a dedicated test, especially not a SolrCloud test. For example, we randomly pick compatible substitutes for the codec. Shouldn't we do the same for the underlying "wt" encoding too?

@noblepaul noblepaul merged commit faf5d3c into apache:main Jun 5, 2023
Copy link
Contributor

@gerlowskija gerlowskija left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CBOR serialization code itself looks OK, but the test/benchmarking code could really use some improvement IMO.

I'm also a little concerned about what's not in this PR. In particular: documentation anywhere about the new format. And, less crucially, deprecation for javabin (or guidance around when users should use javabin vs cbor).


byte[] b =
Files.readAllBytes(
new File(ExternalPaths.SOURCE_HOME, "example/films/films.json").toPath());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] Does the films dataset have sufficient variety of field types to fully validate the cbor serialization/deserialization logic? Is there testing elsewhere that would make sure that cbor serialization doesn't blow up when presented with e.g. a 'binary' or 'pint' field

byte[] b =
Files.readAllBytes(
new File(ExternalPaths.SOURCE_HOME, "example/films/films.json").toPath());
// every operation is performed twice. We should only take the second number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[-0] Is this class a test meant to validate correctness, or a benchmark for measuring performance? I think both are needed and valuable; it's great you thought of both. But I think it's a mistake to combine the two in a single class. Especially when we have a whole "benchmark" module built for exactly the sort of measurements you're taking. It offers tools (e.g. JMH annotations, data generators) that make it easy to test performance much more robustly than is feasible to do here. (See my comment below).

[-1] It's a good start, but I think there's a handful of problems with how this JUnit method gathers numbers around CBOR queries, that make it misleading for gathering even anecdotal performance data.

I'm sure you've done other benchmarking at this point to prove out the CBOR idea, but so far these are the only numbers the rest of the community has seen. So we need to make sure they're rock solid! In particular, here are a few of my concerns:

  1. Even for tiny methods called in a tight loop, JVMs often take more than n=2 iterations to optimize bytecode. The 'N' here is just too small to assume the JVM is fully "hot" on the ser-de code for each format. Even ignoring JVM optimizations, n=2 is too small to rule out random noise (other processes competing for CPU, JVM GC, etc.) on the host machine.
  2. The order in which you're testing formats likely skews results. e.g. The two JSON queries run on an almost entirely cold JVM, whereas CBOR queries run when the JVM, OS caches, Solr caches, etc. are considerably warmer.
  3. "films.json" is pretty small as a data set and homogenous in terms of a few field-types. I worry that would make the gathered data unrepresentative and noisy.

I think all three of these would be easy to address if the code was moved to the benchmarking module. It has data generators for building larger and more diverse datasets. JMH has a lot of support built in that make it easy to configure "Warming" iterations that get ignored in the collected statistics. etc.

request.setResponseParser(new InputStreamResponseParser(wt));
result = client.request(request, testCollection);
byte[] b = copyStream((InputStream) result.get("stream"));
System.out.println(wt + "_time : " + timer.getTime());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] So, as I read this code - runQuery will capture CBOR serialization in its perf numbers but not deserialization since we're not deserializing the received data on the client side.

Am I reading that right? And if so, is that intentional?

[0] Also, doesn't forbiddenApis complain about the System.out.println usage here? I swear that was one of the things we disallowed, though maybe we're looser about that in our tests...

noblepaul added a commit to cowpaths/fullstory-solr that referenced this pull request Jun 12, 2023
hiteshk25 pushed a commit to cowpaths/fullstory-solr that referenced this pull request Jul 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants