Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17050: Save models as compacted json #2030

Merged
merged 21 commits into from
Dec 4, 2023

Conversation

babesflorin
Copy link
Contributor

https://issues.apache.org/jira/browse/SOLR-17050

Description

Currenly we have a limit for how big a model can be when uploaded to solr. That is because of zookeeper file size limitations. Models are now stored in prettified json with an indent of 2 spaces.

Solution

Store the model as compacted json. For my use case a model dropped from 27MB to 8.8MB by saving it as compacted json

Tests

Added test for Utils.toJson

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@cpoerschke
Copy link
Contributor

Hi Florin!

Thanks for opening this pull request, with a test and scoped to only change the learning-to-rank model store.

I've added a commit with minor tweaks, feel free to revert or amend.

Taking a step back from the code, here's a couple of thoughts or questions:

  • Do we need to consider backwards compatibility i.e. existing storage behaviour changing and/or how to communicate the change in the upgrade notes?

  • Might some users prefer the existing behaviour i.e. more storage space but also more human readable?

  • Would consistency between the LTR model store and the LTR feature store be helpful?

  • Might it be possible to support both the existing and the new behaviour?

    • A system property e.g. -Dsolr.ltr.model-store.indentSize=2 and -Dsolr.ltr.feature-store.indentSize=2 perhaps?
    • A configurable somehow, how?
    • A parameter that is passed during upload e.g. different models could have different indentation?
    • An alternative endpoint for uploading?
    • Something else?

What do you think?

@cpoerschke
Copy link
Contributor

Would love to hear perspectives from others here too e.g. @alessandrobenedetti @epugh @janhoy or @tomglk perhaps? Thank you.

@alessandrobenedetti
Copy link
Contributor

Before proceeding with the code review, let me ask an additional question:
How is the size affecting the current implementation?
Zookeeper on heap/off-heap memory? Zookeeper disk?
Apache Solr on heap/off-heap memory?
Apache Solr query time performance?

Depending on the answers to these questions I would could be in favour of forcing the compact storage (if it's beneficial for performance significantly).
In terms of the configurability discussion I believe it could be an over-complication, if it's for human readability I suspect it would be quite a quick process for a human to get the JSON and put it on some online editor such as jsonlint.com?
Let's also remember that the limit in Zookeeper is currently configurable, but I agree it often causes headaches.
And it comes without saying, thank you very much for the contribution @holysleeper and thanks @cpoerschke for the heads up!

@epugh
Copy link
Contributor

epugh commented Oct 27, 2023

I think that if it makes sense to compact the JSON, then just compact it. It's simple enought to pipe through jq or we add formatting options on the way out.

@epugh
Copy link
Contributor

epugh commented Oct 27, 2023

One thing I meant to ask, why not use the PackageStore instead? It also (i believe!) would replicate a model around the cluster, the same way ZK does...

@babesflorin
Copy link
Contributor Author

babesflorin commented Oct 29, 2023

Hi Florin!

Thanks for opening this pull request, with a test and scoped to only change the learning-to-rank model store.

I've added a commit with minor tweaks, feel free to revert or amend.

Taking a step back from the code, here's a couple of thoughts or questions:

  • Do we need to consider backwards compatibility i.e. existing storage behaviour changing and/or how to communicate the change in the upgrade notes?

  • Might some users prefer the existing behaviour i.e. more storage space but also more human readable?

  • Would consistency between the LTR model store and the LTR feature store be helpful?

  • Might it be possible to support both the existing and the new behaviour?

    • A system property e.g. -Dsolr.ltr.model-store.indentSize=2 and -Dsolr.ltr.feature-store.indentSize=2 perhaps?
    • A configurable somehow, how?
    • A parameter that is passed during upload e.g. different models could have different indentation?
    • An alternative endpoint for uploading?
    • Something else?

What do you think?

Hello @cpoerschke,
Thanks for the fix.

The fix is backwards compatible with old saved models. When an update (add/delete) to the model store is done, the new file will be saved as compacted JSON.

curl http://localhost:8983/solr/small_models/admin/file\?wt\=json\&_\=1698594868357\&file\=_schema_model-store.json\&contentType\=application%2Fjson%3Bcharset%3Dutf-8
{
"initArgs":{},
"initializedOn":"2023-10-29T15:25:29.481Z",
"managedList":[{
"name":"6029760550880411648",
"class":"org.apache.solr.ltr.model.LinearModel",
"store":"DEFAULT",
"features":[
{
"name":"title",
"norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},
{
"name":"description",
"norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},
{
"name":"keywords",
"norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},
{
"name":"popularity",
"norm":{
"class":"org.apache.solr.ltr.norm.MinMaxNormalizer",
"params":{
"min":"0.0",
"max":"10.0"}}},
{
"name":"text",
"norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},
{
"name":"queryIntentPerson",
"norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},
{
"name":"queryIntentCompany",
"norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}}],
"params":{"weights":{
"title":0.0,
"description":0.1,
"keywords":0.2,
"popularity":0.3,
"text":0.4,
"queryIntentPerson":0.1231231,
"queryIntentCompany":0.12121211}}}]}%
➜ ~ curl -XPUT 'http://localhost:8983/solr/small_models/schema/model-store' --data-binary "@/tmp/linear-model-new.json" -H 'Content-type:application/json'
{
"responseHeader":{
"status":0,
"QTime":6
}
}%
➜ ~ curl http://localhost:8983/solr/small_models/admin/file\?wt\=json\&_\=1698594868357\&file\=_schema_model-store.json\&contentType\=application%2Fjson%3Bcharset%3Dutf-8
{"initArgs":{},"initializedOn":"2023-10-29T15:44:47.459Z","updatedSinceInit":"2023-10-29T15:45:15.689Z","managedList":[{"name":"6029760550880411648-after-fix","class":"org.apache.solr.ltr.model.LinearModel","store":"DEFAULT","features":[{"name":"title","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"description","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"keywords","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"popularity","norm":{"class":"org.apache.solr.ltr.norm.MinMaxNormalizer","params":{"min":"0.0","max":"10.0"}}},{"name":"text","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"queryIntentPerson","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"queryIntentCompany","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}}],"params":{"weights":{"title":0.0,"description":0.1,"keywords":0.2,"popularity":0.3,"text":0.4,"queryIntentPerson":0.1231231,"queryIntentCompany":0.12121211}}},{"name":"6029760550880411648","class":"org.apache.solr.ltr.model.LinearModel","store":"DEFAULT","features":[{"name":"title","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"description","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"keywords","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"popularity","norm":{"class":"org.apache.solr.ltr.norm.MinMaxNormalizer","params":{"min":"0.0","max":"10.0"}}},{"name":"text","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"queryIntentPerson","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"queryIntentCompany","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}}],"params":{"weights":{"title":0.0,"description":0.1,"keywords":0.2,"popularity":0.3,"text":0.4,"queryIntentPerson":0.1231231,"queryIntentCompany":0.12121211}}}]}%
➜ ~ curl http://localhost:8983/solr/small_models/select\?debugQuery\=false\&fl\=id%2C%20score%2C%5Bfeatures%5D\&indent\=true\&q.op\=OR\&q\=\*%3A\*\&rq\=%7B\!ltr%20model%3D6029760550880411648-after-fix%7D\&useParams\=
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":4,
"params":{
"df":"text",
"indent":"true",
"echoParams":"all",
"fl":"id, score,[features]",
"q.op":"OR",
"distrib.singlePass":"true",
"rows":"10",
"q":":",
"shards.tolerant":"true",
"sow":"true",
"shards.preference":"replica.type:PULL,replica.location:local,replica.base:stable:hash",
"wt":"json",
"debugQuery":"false",
"useParams":"",
"rq":"{!ltr model=6029760550880411648-after-fix}",
"rid":"null-191"}},
"response":{"numFound":2,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[
{
"id":"123",
"score":3.5116758,
"[features]":"title=1.0,description=2.0,keywords=2.0,popularity=3.0,text=4.0,queryIntentPerson=5.0,queryIntentCompany=5.0"},
{
"id":"1223",
"score":3.5116758,
"[features]":"title=1.0,description=2.0,keywords=2.0,popularity=3.0,text=4.0,queryIntentPerson=5.0,queryIntentCompany=5.0"}]
}}

As for "Might some users prefer the existing behaviour i.e. more storage space but also more human readable?", only the file is stored as compacted JSON. If model-store endpoint will return a formatted JSON.
Ex:

~ curl http://localhost:8983/solr/small_models/admin/file\?wt\=json\&_\=1698594868357\&file\=_schema_model-store.json\&contentType\=application%2Fjson%3Bcharset%3Dutf-8
{"initArgs":{},"initializedOn":"2023-10-29T15:44:47.459Z","updatedSinceInit":"2023-10-29T16:06:58.021Z","managedList":[{"name":"6029760550880411648-after-fix","class":"org.apache.solr.ltr.model.LinearModel","store":"DEFAULT","features":[{"name":"title","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"description","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"keywords","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"popularity","norm":{"class":"org.apache.solr.ltr.norm.MinMaxNormalizer","params":{"min":"0.0","max":"10.0"}}},{"name":"text","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"queryIntentPerson","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"queryIntentCompany","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}}],"params":{"weights":{"title":0.0,"description":0.1,"keywords":0.2,"popularity":0.3,"text":0.4,"queryIntentPerson":0.1231231,"queryIntentCompany":0.12121211}}},{"name":"6029760550880411648","class":"org.apache.solr.ltr.model.LinearModel","store":"DEFAULT","features":[{"name":"title","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"description","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"keywords","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"popularity","norm":{"class":"org.apache.solr.ltr.norm.MinMaxNormalizer","params":{"min":"0.0","max":"10.0"}}},{"name":"text","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"queryIntentPerson","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"queryIntentCompany","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}}],"params":{"weights":{"title":0.0,"description":0.1,"keywords":0.2,"popularity":0.3,"text":0.4,"queryIntentPerson":0.1231231,"queryIntentCompany":0.12121211}}}]}%
➜ ~ curl http://localhost:8983/solr/small_models/schema/model-store
{
"responseHeader":{
"status":0,
"QTime":1
},
"models":[{
"name":"6029760550880411648-after-fix",
"class":"org.apache.solr.ltr.model.LinearModel",
"store":"DEFAULT",
"features":[{
"name":"title",
"norm":{
"class":"org.apache.solr.ltr.norm.IdentityNormalizer"
}
},{
"name":"description",
"norm":{
"class":"org.apache.solr.ltr.norm.IdentityNormalizer"
}
},{
"name":"keywords",
"norm":{
"class":"org.apache.solr.ltr.norm.IdentityNormalizer"
}
},{
"name":"popularity",
"norm":{
"class":"org.apache.solr.ltr.norm.MinMaxNormalizer",
"params":{
"min":"0.0",
"max":"10.0"
}
}
},{
"name":"text",
"norm":{
"class":"org.apache.solr.ltr.norm.IdentityNormalizer"
}
},{
"name":"queryIntentPerson",
"norm":{
"class":"org.apache.solr.ltr.norm.IdentityNormalizer"
}
},{
"name":"queryIntentCompany",
"norm":{
"class":"org.apache.solr.ltr.norm.IdentityNormalizer"
}
}],
"params":{
"weights":{
"title":0.0,
"description":0.1,
"keywords":0.2,
"popularity":0.3,
"text":0.4,
"queryIntentPerson":0.1231231,
"queryIntentCompany":0.12121211
}
}
},{
"name":"6029760550880411648",
"class":"org.apache.solr.ltr.model.LinearModel",
"store":"DEFAULT",
"features":[{
"name":"title",
"norm":{
"class":"org.apache.solr.ltr.norm.IdentityNormalizer"
}
},{
"name":"description",
"norm":{
"class":"org.apache.solr.ltr.norm.IdentityNormalizer"
}
},{
"name":"keywords",
"norm":{
"class":"org.apache.solr.ltr.norm.IdentityNormalizer"
}
},{
"name":"popularity",
"norm":{
"class":"org.apache.solr.ltr.norm.MinMaxNormalizer",
"params":{
"min":"0.0",
"max":"10.0"
}
}
},{
"name":"text",
"norm":{
"class":"org.apache.solr.ltr.norm.IdentityNormalizer"
}
},{
"name":"queryIntentPerson",
"norm":{
"class":"org.apache.solr.ltr.norm.IdentityNormalizer"
}
},{
"name":"queryIntentCompany",
"norm":{
"class":"org.apache.solr.ltr.norm.IdentityNormalizer"
}
}],
"params":{
"weights":{
"title":0.0,
"description":0.1,
"keywords":0.2,
"popularity":0.3,
"text":0.4,
"queryIntentPerson":0.1231231,
"queryIntentCompany":0.12121211
}
}
}]
}%

Yes, a consistency between LTR model-store and LTR feature-store will be helpful. I will do a commit soon.

I do not think we should keep the both behaviours because this an improvement and not a new feature. The users should not be affected because the model-store API endpoint will return the same formatted JSON as before.

@babesflorin
Copy link
Contributor Author

Before proceeding with the code review, let me ask an additional question: How is the size affecting the current implementation? Zookeeper on heap/off-heap memory? Zookeeper disk? Apache Solr on heap/off-heap memory? Apache Solr query time performance?

Depending on the answers to these questions I would could be in favour of forcing the compact storage (if it's beneficial for performance significantly). In terms of the configurability discussion I believe it could be an over-complication, if it's for human readability I suspect it would be quite a quick process for a human to get the JSON and put it on some online editor such as jsonlint.com? Let's also remember that the limit in Zookeeper is currently configurable, but I agree it often causes headaches. And it comes without saying, thank you very much for the contribution @holysleeper and thanks @cpoerschke for the heads up!

Hello @alessandrobenedetti. The current implementation is affecting the off-heap Zookeeper memory size. Currently if we have 2 models each of 20MB and we want to add another one, zookeeper will use a lot of off-heap memory and it could go OOM (our container orchestrator will kill the container). Even if we have the file size limit jute.maxbuffer set to 536870912 and ZK heap to 4GB.

The improvement in current PR will allow us to upload more models using less memory for ZK.

@babesflorin
Copy link
Contributor Author

One thing I meant to ask, why not use the PackageStore instead? It also (i believe!) would replicate a model around the cluster, the same way ZK does...

I am not familiar with PackageStore, I will give it a look in the near future. If it works as you say, it could solve a lot of issues that can come with using big models and could be more stable than using DefaultWrapperModel.

For the moment I think the implementation in this PR can give us a nice reward with little effort.

@cpoerschke
Copy link
Contributor

... only the file is stored as compacted ... model-store endpoint will return a formatted JSON. ...

You're right. I'd assumed users would 'look' directly in ZooKeeper and forgot that the endpoint can also return the model.

@epugh
Copy link
Contributor

epugh commented Nov 2, 2023

we really don't want our users actually poking around in ZooKeeper... It really should be mediated by APIs and treated as an internal "implementation details"....!

@babesflorin
Copy link
Contributor Author

Hello @cpoerschke. I've done the same thing for feature store

Copy link
Contributor

@cpoerschke cpoerschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. A nice internal implementation detail improvement, and supporting PackageStore use could be explored separately in future.

@cpoerschke
Copy link
Contributor

The solr:modules:extraction:test CI failure seems unrelated/pre-existing/environmental e.g. https://lists.apache.org/thread/fgmlyf8smtj8qyy14466bqfjhsqhqqjs -- cannot reproduce locally.

Copy link
Contributor

@alessandrobenedetti alessandrobenedetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contribution looks good and it's interesting.
I left some minor comments that should be addressed in my opinion.
Not sure we want also some integration tests to verify we really store the stuff in Zookeeper as we want.
Not a blocker though,
Thanks!

public JsonStorage(StorageIO storageIO, SolrResourceLoader loader) {
this(storageIO, loader, 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this '2' is a default parameter I would prefer to see it as a constant, it would be more readable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore, who's calling this default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this '2' is a default parameter I would prefer to see it as a constant, it would be more readable

cc4a2d4 creates a JSONWriter.DEFAULT_INDENT constant and uses it here.

Though I could also see that perhaps that's undesirable since it exposes JSONWriter implementation detail here i.e. a constant elsewhere would avoid that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at that commit, and it looks good to me! I don't see much problem with it!

assertEquals(
"{\"k2\":\"v2\",\"k1\":{\"a\":\"b\",\"p\":\"r\",\"k21\":{\"xx\":\"yy\"}}}",
new String(Utils.toJSON(object, -1), UTF_8));
String formatedJson =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatted?

@@ -294,4 +294,23 @@ public void testMergeJson() {
assertEquals(
2L, Utils.getObjectByPath(sink, true, List.of(DEFAULTS, COLLECTION_PROP, NRT_REPLICAS)));
}

@SuppressWarnings({"unchecked"})
public void testToJson() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer 3 tests here:
edge case 0
edge case -1
standard case

I've never been a fan of mixing test cases within a single one (even if it's simple)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the required tests. Thanks for the suggestion!

solr/CHANGES.txt Outdated
Comment on lines 99 to 100
* SOLR-17050: Use compact JSON for Learning to Rank (LTR) feature and model storage. (Florin Babes, Christine Poerschke)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e369a6c undoes the CHANGES.txt entry addition temporarily, to be added back before merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpoerschke I have to add the entry in CHANGES.txt ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpoerschke I have to add the entry in CHANGES.txt ?

Thanks for asking! I've added the entry back in 02d166f after merging in the latest origin/main -- CHANGES.txt more easily encounters merge conflicts and sometimes it is easier to defer the entry addition until closer to merge time.

@babesflorin
Copy link
Contributor Author

The contribution looks good and it's interesting. I left some minor comments that should be addressed in my opinion. Not sure we want also some integration tests to verify we really store the stuff in Zookeeper as we want. Not a blocker though, Thanks!

@alessandrobenedetti I also added a test to check the storage in zookeeper.
@cpoerschke Thanks for adding the constant.

babesflorin and others added 3 commits November 20, 2023 14:44
…e compactness is tested

Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
…abes/solr into save_models_compacted_SOLR-17050
@babesflorin
Copy link
Contributor Author

I did an update for a failed tidy check. Also @cpoerschke I did a commit with your suggestion.

@cpoerschke
Copy link
Contributor

The solr:modules:extraction:test CI failure seems unrelated/pre-existing/environmental e.g. https://lists.apache.org/thread/fgmlyf8smtj8qyy14466bqfjhsqhqqjs -- cannot reproduce locally.

Filed https://issues.apache.org/jira/browse/SOLR-17081 for visibility.

…_SOLR-17050

Resolved Conflicts:
	solr/CHANGES.txt
@cpoerschke
Copy link
Contributor

If there are no further comments or concerns or objections then I'll aim to merge this PR early next week.

Thanks @holysleeper for this contribution!

@cpoerschke cpoerschke self-assigned this Dec 1, 2023
@cpoerschke cpoerschke merged commit ce99303 into apache:main Dec 4, 2023
3 checks passed
asfgit pushed a commit that referenced this pull request Dec 4, 2023
…odel storage. (#2030)

Co-authored-by: Florin Babes <florin.babes@emag.ro>
Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
(cherry picked from commit ce99303)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants