SOLR-16812: Support CBOR format for update/query#1655
SOLR-16812: Support CBOR format for update/query#1655noblepaul merged 16 commits intoapache:mainfrom
Conversation
solr/solrj/src/java/org/apache/solr/client/solrj/request/GenericSolrRequest.java
Show resolved
Hide resolved
dsmiley
left a comment
There was a problem hiding this comment.
Cool to see this new format!
| import org.apache.solr.handler.loader.CborStream; | ||
| import org.apache.solr.response.XMLResponseWriter; | ||
|
|
||
| public class TestCborDataFormat extends SolrCloudTestCase { |
There was a problem hiding this comment.
IMO our request/response formats, to include this new one, should be tested in a general way without a dedicated test, especially not a SolrCloud test. For example, we randomly pick compatible substitutes for the codec. Shouldn't we do the same for the underlying "wt" encoding too?
solr/core/src/java/org/apache/solr/handler/loader/CborStream.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/handler/UpdateRequestHandler.java
Outdated
Show resolved
Hide resolved
gerlowskija
left a comment
There was a problem hiding this comment.
The CBOR serialization code itself looks OK, but the test/benchmarking code could really use some improvement IMO.
I'm also a little concerned about what's not in this PR. In particular: documentation anywhere about the new format. And, less crucially, deprecation for javabin (or guidance around when users should use javabin vs cbor).
|
|
||
| byte[] b = | ||
| Files.readAllBytes( | ||
| new File(ExternalPaths.SOURCE_HOME, "example/films/films.json").toPath()); |
There was a problem hiding this comment.
[Q] Does the films dataset have sufficient variety of field types to fully validate the cbor serialization/deserialization logic? Is there testing elsewhere that would make sure that cbor serialization doesn't blow up when presented with e.g. a 'binary' or 'pint' field
| byte[] b = | ||
| Files.readAllBytes( | ||
| new File(ExternalPaths.SOURCE_HOME, "example/films/films.json").toPath()); | ||
| // every operation is performed twice. We should only take the second number |
There was a problem hiding this comment.
[-0] Is this class a test meant to validate correctness, or a benchmark for measuring performance? I think both are needed and valuable; it's great you thought of both. But I think it's a mistake to combine the two in a single class. Especially when we have a whole "benchmark" module built for exactly the sort of measurements you're taking. It offers tools (e.g. JMH annotations, data generators) that make it easy to test performance much more robustly than is feasible to do here. (See my comment below).
[-1] It's a good start, but I think there's a handful of problems with how this JUnit method gathers numbers around CBOR queries, that make it misleading for gathering even anecdotal performance data.
I'm sure you've done other benchmarking at this point to prove out the CBOR idea, but so far these are the only numbers the rest of the community has seen. So we need to make sure they're rock solid! In particular, here are a few of my concerns:
- Even for tiny methods called in a tight loop, JVMs often take more than n=2 iterations to optimize bytecode. The 'N' here is just too small to assume the JVM is fully "hot" on the ser-de code for each format. Even ignoring JVM optimizations, n=2 is too small to rule out random noise (other processes competing for CPU, JVM GC, etc.) on the host machine.
- The order in which you're testing formats likely skews results. e.g. The two JSON queries run on an almost entirely cold JVM, whereas CBOR queries run when the JVM, OS caches, Solr caches, etc. are considerably warmer.
- "films.json" is pretty small as a data set and homogenous in terms of a few field-types. I worry that would make the gathered data unrepresentative and noisy.
I think all three of these would be easy to address if the code was moved to the benchmarking module. It has data generators for building larger and more diverse datasets. JMH has a lot of support built in that make it easy to configure "Warming" iterations that get ignored in the collected statistics. etc.
| request.setResponseParser(new InputStreamResponseParser(wt)); | ||
| result = client.request(request, testCollection); | ||
| byte[] b = copyStream((InputStream) result.get("stream")); | ||
| System.out.println(wt + "_time : " + timer.getTime()); |
There was a problem hiding this comment.
[Q] So, as I read this code - runQuery will capture CBOR serialization in its perf numbers but not deserialization since we're not deserializing the received data on the client side.
Am I reading that right? And if so, is that intentional?
[0] Also, doesn't forbiddenApis complain about the System.out.println usage here? I swear that was one of the things we disallowed, though maybe we're looser about that in our tests...
No description provided.