Skip to content

Conversation

@danrosher
Copy link
Contributor

@danrosher danrosher commented Jul 11, 2022

https://issues.apache.org/jira/browse/SOLR-16286

Description

getCheckpoints() does honor initialCheckpoint, but when stored, getPersistedCheckpoints which is processed first, does not. The effect is that initialCheckpoint=0 doesn't work as expected after it's stored.

Solution

Modify getPersistedCheckpoints to honor initialCheckpoint as it does in getCheckpoints

Tests

Added StreamExpressionTest.testTopicStreamInitialCheckpoint to do run a topic SE, then persist it, then re-run to ensure initialCheckpoint=0 is still honored, then again without initialCheckpoint=0 to ensure zero returned as up to date.,

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@danrosher
Copy link
Contributor Author

./gradlew check passed except:

`
Crawl/parse...

Verify...

.../solr/solr/documentation/build/site/modules/jwt-auth/org/apache/solr/handler/admin/api/ModifyJWTAuthPluginConfigAPI.html
BROKEN LINK: .../solr/solr/documentation/build/site/core/org/apache/solr/handler/admin/api/JWTConfigurationPayload.html
BROKEN LINK: .../solr/solr/documentation/build/site/core/org/apache/solr/handler/admin/api/JWTConfigurationPayload.html

Broken javadocs links were found! Common root causes:

  • A typo of some sort for manually created links.
  • Public methods referencing non-public classes in their signature.
    ~
    `
    unsure why this failed though

Copy link
Contributor

@cpoerschke cpoerschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @danrosher for opening this PR!

@cpoerschke
Copy link
Contributor

Broken javadocs links were found! Common root causes:

  • A typo of some sort for manually created links.
  • Public methods referencing non-public classes in their signature.
    ~
    `
    unsure why this failed though

#944 sounds like it might be related.

danrosher and others added 2 commits July 20, 2022 11:12
…eamExpressionTest.java

Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
…eamExpressionTest.java

Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
Comment on lines +4373 to +4382
expr =
"classify("
+
// use cacheMillis=0 to prevent cached results. it doesn't matter on the first run,
// but we want to ensure that when we re-use this expression later after
// training another model, we'll still get accurate results.
"model(modelCollection, id=\"model\", cacheMillis=0),"
+ "topic(checkpointCollection, uknownCollection, q=\"*:*\", fl=\"text_s, id\", id=\"1000000\"),"
+ "field=\"text_s\","
+ "analyzerField=\"tv_text\")";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the same expression as above but without the initialCheckpoint=0 ... though reading the (current) docs that means "the highest version in the index" though if the highest version in the index was used then the first batch in the stream below would not include the documents just added with ids 2 and 3?

Wondering if the documentation needs tweaking to account for persisted checkpoints?

https://github.com/apache/solr/blob/releases/solr/9.0.0/solr/solr-ref-guide/modules/query-guide/pages/stream-source-reference.adoc#topic-parameters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkpoints are persisted when the stream is closed, or if checkpointEvery > -1 (and then every count % checkpointEvery), otherwise the checkpoints are stored in the checkpoints hashmap, so for 'just' added docs, I think as long as is it matches the underlying query, and those docs have been soft committed (see caveat for topicstream SOLR-8709), I think they should be picked up, unless I'm completely misunderstanding ?

@epugh
Copy link
Contributor

epugh commented Aug 17, 2022

Thoughts @cpoerschke and @danrosher on this being ready for merging? Seems like a valuable bug fix! Or is there still a lack of clarity on the right behavior?

@danrosher
Copy link
Contributor Author

danrosher commented Aug 17, 2022 via email

@cpoerschke
Copy link
Contributor

... Or is there still a lack of clarity on the right behavior?

I think the new behaviour makes sense.

Am less sure w.r.t. the documentation and/or how to describe the changes in behaviour. @epugh - feel free to jump in if you wish.

https://github.com/apache/solr/blob/5a5989e5b6164091243dd29cfe327b5eaac2cfbd/solr/solr-ref-guide/modules/query-guide/pages/stream-source-reference.adoc#topic-parameters currently says "If not set, it defaults to the highest version in the index." -- is that the highest version of documents in the index or is it alluding to checkpoints somehow? Would something like "If not set, it defaults to previously established checkpoints (if any) or otherwise the highest version in the index." be accurate and user-friendly?

Copy link
Contributor

@risdenk risdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good bug fix. I'm not sure why it stalled. @epugh @danrosher @cpoerschke maybe just docs?

@epugh
Copy link
Contributor

epugh commented Oct 27, 2022

i think some question about the docs. I like everything here ;-). I am going to add it to my "List of tickets to merge on Monday Oct 31st" so folks can weigh in ;-)

@epugh
Copy link
Contributor

epugh commented Oct 31, 2022

@joel-bernstein I was going to merge this today, but I realized that might conflict with your work on seperating out the streaming code? Should I wait? Do you want to add this to your list of tickets to merge once that work is done?

@github-actions
Copy link

This PR had no visible activity in the past 60 days, labeling it as stale. Any new activity will remove the stale label. To attract more reviewers, please tag someone or notify the dev@solr.apache.org mailing list. Thank you for your contribution!

@github-actions github-actions bot added the stale PR not updated in 60 days label Feb 19, 2024
@github-actions
Copy link

github-actions bot commented Oct 9, 2024

This PR is now closed due to 60 days of inactivity after being marked as stale. Re-opening this PR is still possible, in which case it will be marked as active again.

@github-actions github-actions bot added the closed-stale Closed after being stale for 60 days label Oct 9, 2024
@github-actions github-actions bot closed this Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

closed-stale Closed after being stale for 60 days stale PR not updated in 60 days

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants