-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GEODE-8536: Allow limited retries when creating Lucene IndexWriter #5553
Conversation
@@ -44,6 +44,7 @@ | |||
private static final Logger logger = LogService.getLogger(); | |||
public static final String FILE_REGION_LOCK_FOR_BUCKET_ID = "FileRegionLockForBucketId:"; | |||
public static final String APACHE_GEODE_INDEX_COMPLETE = "APACHE_GEODE_INDEX_COMPLETE"; | |||
protected static final int GET_INDEX_WRITER_MAX_ATTEMPTS = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think the number of retries is enough?
Based on the original ticket description, the IOException thrown is caused by "LuceneEventListener is asynchronously updating the fileAndChunkRegion". Do we know if the wait is enough for the updating to finish? Is it a problem if IndexWriter creation needs to wait longer for the resources to be freed?
Do we know "updating the fileAndChunkRegion" usually do not require a minute to finish, or we actually hit this issue due to different threads keep updating the fileAndChunkRegion? If so, we can decide the number of attempts and whether the wait needs to be using different intervals.
If we do not know answers for these, I think this code change is fine to fix the StackOverflowError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, the timing window to hit the IOException is quite small and difficult to hit, since this problem only shows up in about 1 in 1000 runs of the test I used to diagnose the issue. If the fileAndChunkRegion was unavailable for a long period of time, I would expect to see this issue reproduce more often. After running some experiments, I was able to increase the number of retries to 200 without any noticeable negative effects, which would increase the time window during which IOExceptions would have to be consistently encountered and an exception thrown to 1 second, which should help reduce the chances of encountering it. However, I don't think it's possible to know for certain how long the fileAndChunkRegion might be unavailable, since that could change based on the operation being used on it, the size of the region, current system resources etc.
Authored-by: Donal Evans <doevans@vmware.com>
Authored-by: Donal Evans <doevans@vmware.com>
c54594d
to
8afd208
Compare
Authored-by: Donal Evans doevans@vmware.com
Thank you for submitting a contribution to Apache Geode.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?
Has your PR been rebased against the latest commit within the target branch (typically
develop
)?Is your initial contribution a single, squashed commit?
Does
gradlew build
run cleanly?Have you written or updated unit tests to verify your changes?
[N/A] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
Note:
Please ensure that once the PR is submitted, check Concourse for build issues and
submit an update to your PR as soon as possible. If you need help, please send an
email to dev@geode.apache.org.