-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-17050: Save models as compacted json #2030
SOLR-17050: Save models as compacted json #2030
Conversation
…l JsonStorage.indentSize
Hi Florin! Thanks for opening this pull request, with a test and scoped to only change the learning-to-rank model store. I've added a commit with minor tweaks, feel free to revert or amend. Taking a step back from the code, here's a couple of thoughts or questions:
What do you think? |
Would love to hear perspectives from others here too e.g. @alessandrobenedetti @epugh @janhoy or @tomglk perhaps? Thank you. |
Before proceeding with the code review, let me ask an additional question: Depending on the answers to these questions I would could be in favour of forcing the compact storage (if it's beneficial for performance significantly). |
I think that if it makes sense to compact the JSON, then just compact it. It's simple enought to pipe through |
One thing I meant to ask, why not use the PackageStore instead? It also (i believe!) would replicate a model around the cluster, the same way ZK does... |
Hello @cpoerschke, The fix is backwards compatible with old saved models. When an update (add/delete) to the model store is done, the new file will be saved as compacted JSON.
As for "Might some users prefer the existing behaviour i.e. more storage space but also more human readable?", only the file is stored as compacted JSON. If model-store endpoint will return a formatted JSON.
Yes, a consistency between LTR model-store and LTR feature-store will be helpful. I will do a commit soon. I do not think we should keep the both behaviours because this an improvement and not a new feature. The users should not be affected because the model-store API endpoint will return the same formatted JSON as before. |
Hello @alessandrobenedetti. The current implementation is affecting the off-heap Zookeeper memory size. Currently if we have 2 models each of 20MB and we want to add another one, zookeeper will use a lot of off-heap memory and it could go OOM (our container orchestrator will kill the container). Even if we have the file size limit jute.maxbuffer set to 536870912 and ZK heap to 4GB. The improvement in current PR will allow us to upload more models using less memory for ZK. |
I am not familiar with PackageStore, I will give it a look in the near future. If it works as you say, it could solve a lot of issues that can come with using big models and could be more stable than using DefaultWrapperModel. For the moment I think the implementation in this PR can give us a nice reward with little effort. |
You're right. I'd assumed users would 'look' directly in ZooKeeper and forgot that the endpoint can also return the model. |
we really don't want our users actually poking around in ZooKeeper... It really should be mediated by APIs and treated as an internal "implementation details"....! |
Hello @cpoerschke. I've done the same thing for feature store |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. A nice internal implementation detail improvement, and supporting PackageStore
use could be explored separately in future.
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contribution looks good and it's interesting.
I left some minor comments that should be addressed in my opinion.
Not sure we want also some integration tests to verify we really store the stuff in Zookeeper as we want.
Not a blocker though,
Thanks!
public JsonStorage(StorageIO storageIO, SolrResourceLoader loader) { | ||
this(storageIO, loader, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this '2' is a default parameter I would prefer to see it as a constant, it would be more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, who's calling this default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this '2' is a default parameter I would prefer to see it as a constant, it would be more readable
cc4a2d4 creates a JSONWriter.DEFAULT_INDENT
constant and uses it here.
Though I could also see that perhaps that's undesirable since it exposes JSONWriter
implementation detail here i.e. a constant elsewhere would avoid that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look at that commit, and it looks good to me! I don't see much problem with it!
assertEquals( | ||
"{\"k2\":\"v2\",\"k1\":{\"a\":\"b\",\"p\":\"r\",\"k21\":{\"xx\":\"yy\"}}}", | ||
new String(Utils.toJSON(object, -1), UTF_8)); | ||
String formatedJson = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatted?
@@ -294,4 +294,23 @@ public void testMergeJson() { | |||
assertEquals( | |||
2L, Utils.getObjectByPath(sink, true, List.of(DEFAULTS, COLLECTION_PROP, NRT_REPLICAS))); | |||
} | |||
|
|||
@SuppressWarnings({"unchecked"}) | |||
public void testToJson() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer 3 tests here:
edge case 0
edge case -1
standard case
I've never been a fan of mixing test cases within a single one (even if it's simple)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the required tests. Thanks for the suggestion!
…_SOLR-17050 Resolved Conflicts: solr/CHANGES.txt
solr/CHANGES.txt
Outdated
* SOLR-17050: Use compact JSON for Learning to Rank (LTR) feature and model storage. (Florin Babes, Christine Poerschke) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e369a6c undoes the CHANGES.txt entry addition temporarily, to be added back before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpoerschke I have to add the entry in CHANGES.txt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpoerschke I have to add the entry in CHANGES.txt ?
Thanks for asking! I've added the entry back in 02d166f after merging in the latest origin/main
-- CHANGES.txt more easily encounters merge conflicts and sometimes it is easier to defer the entry addition until closer to merge time.
@alessandrobenedetti I also added a test to check the storage in zookeeper. |
solr/modules/ltr/src/test/org/apache/solr/ltr/store/rest/TestModelManagerPersistence.java
Outdated
Show resolved
Hide resolved
…e compactness is tested Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
…abes/solr into save_models_compacted_SOLR-17050
I did an update for a failed tidy check. Also @cpoerschke I did a commit with your suggestion. |
Filed https://issues.apache.org/jira/browse/SOLR-17081 for visibility. |
…_SOLR-17050 Resolved Conflicts: solr/CHANGES.txt
If there are no further comments or concerns or objections then I'll aim to merge this PR early next week. Thanks @holysleeper for this contribution! |
https://issues.apache.org/jira/browse/SOLR-17050
Description
Currenly we have a limit for how big a model can be when uploaded to solr. That is because of zookeeper file size limitations. Models are now stored in prettified json with an indent of 2 spaces.
Solution
Store the model as compacted json. For my use case a model dropped from 27MB to 8.8MB by saving it as compacted json
Tests
Added test for Utils.toJson
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.