-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(metadata): check/prune metadata and compute JVM IDs async #1105
Conversation
@maxcao13 this is not ready for proper review just yet (see note at the end of the PR body), but I wanted to get your eyes on this early. I think you're more familiar with the metadata transferring stuff in here than I am - do you see any obvious cause for the broken archive metadata transfer? If not, I'll take the deep dive in to see what/how I broke. |
I'll take a look. |
There's nothing that seems to be "wrong" from what I see. Seems like the functionality should've been kept the same. Just testing it though, I seem to get some errors with the CredentialsManager?
But sometimes the metadata is correctly being transferred and sometimes not. I think I would look closer at credentials and its being used for the jvmId which in turn is used for archives/metadata transferring for this issue. |
Hmm, okay. Thanks for checking. I hadn't seen that specific failure before so I'll look into that at least. That stack trace looks awfully like things I was seeing after #1083 and that I thought I fixed in #1096, though, where stored credentials come back out of the DAO with nulled out matchexpressons :-/ |
Yikes:
|
Ah this was a bug I have fixed in my other PR. Solved by just adding this if statement
Not sure if the RollbackExceptions or ConcurrentModificationExceptions are related however... |
No, the rollback and concurrent modification are more likely results of #1083... |
Not sure if this helps, but I've found a bug on my PR where the RecordingMetadataManager.accept oldJvmId is not actually "old" when the MetadataManager hooks into the TargetDiscoveryEvents, since for me the DiscoveryStorage has already computed the newJvmId before emitting the notification. That is what breaks my archive/metadata transferring, there may be something similar here if you haven't already figured it out. |
I think I solved the migration and transfer issues I was having before. I think this improves the startup performance a bunch as well as making a few things non-blocking under the hood so that that Custom Targets case I described in the linked issue is improved. Somewhat off-topic: While working on this I had the realization that in the important deployment scenarios, ex. k8s/OpenShift, the JMX Service URL is generally going to contain a service IP address rather than some resolvable DNS hostname. This means that the migration/transfer work we do is unlikely to be triggered to begin with, because a restarted instance may not appear to be "the same" target, as both its JMX Service URL and its JVM ID will have changed. In that kind of case I think it's OK that we don't associate the old archives to the new target instance, however, we DO need to provide the user a way to access those archives still. Just because the target has gone offline and wasn't replaced doesn't mean the user doesn't still need those archives. That's mainly a frontend problem, but we might need to tweak some of our queries to suit. @maxcao13 @tthvo any comments on that last note ^ ? That part really needs to make it into the release but probably has to come after this PR, so it's a bit tight timing. |
Sure, that makes sense, it sort of sucks that these targets might not have the same URL and might not actually be able to transfer archives/metadata, but I assume when using k8s/openshift, there is a way for us to figure out if a restarted instance was actually restarted without using the ServiceURL? |
Maybe? Even if it is possible though, I think it would make sense in that case to keep the data separated out. I have a hunch that any mechanism we implement to try to detect that case will also be very prone to confusing two different (and simultaneous) replicas of the same container with each other. Keeping the archived copies separate in this instance might also be useful for the user to be able to correlate recordings to restarted instances - the JFR data in the archives might actually be helpful in determining why the instance restarted, after all, since it may have done so because the Java program within crashed. Anyway, I think adding a query akin to the old-style This puts us back to square one when it comes to our plan to ensure users cannot view archived recordings from other namespaces, but that's still a deeper problem that needs solving in the next release(s). |
9f0b287
to
5bc77a6
Compare
Oh, and back on-topic for this specific PR: I made a change so that the JPA/Hibernate |
5bc77a6
to
1378c88
Compare
61ee822
to
2570d98
Compare
Okay, I will make an issue and start on that. |
f10eb52
to
01b1594
Compare
https://github.com/cryostatio/cryostat/actions/runs/3228695054/jobs/5285820375#step:8:2300 I'm trying to reproduce this CI failure locally with I'll take some more time to try |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the itest, and some questions/comments, everything seems to work and preserve the synchronous functionality whie increasing throughput and etc. Thanks for doing this!
There is a bug, probably not caused by this PR (and probably front-end related haven't checked main yet)but basically if you go to the archives view and have a target with archived recordings, wait for a autoRefresh, and the target and its archived recordings disappear.
EDIT: Currently happens on main.
src/main/java/io/cryostat/recordings/RecordingMetadataManager.java
Outdated
Show resolved
Hide resolved
src/main/java/io/cryostat/recordings/RecordingMetadataManager.java
Outdated
Show resolved
Hide resolved
Looks like this happens only in All Archive. I will take a look. |
|
…ls or rules are deleted
also removes in-memory metadata map model. use only local filesystem instead. this will be slower, but maintaining two copies of the model is difficult to maintain. later this should be moved into the database and accessed with a DAO
…ry database at startup time
8a2c84b
to
de2b004
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Barring that one specific issue, looks good to me. Great work!
We'll have to keep an eye on that issue and anything else that comes up in the next couple of weeks. If we find anything seriously broken there is a little time after feature freeze and the upstream branching where we can push important bugfixes to the version branch. I would really appreciate if you and @tthvo can try various different scenarios exercising this code, particularly in OpenShift, and see if any other problems appear. |
Coming back to this, I seem to have problems with the current commit. Sometimes, a few targets are unable to be connected to when doing logger.info("Starting archive migration");
archiveHelper.migrate(executor);
logger.info("Successfully migrated archives");
pruneStaleMetadata(staleMetadata);
logger.info("Successfully pruned all stale metadata");
platformClient.listDiscoverableServices().stream().forEach(this::handleFoundTarget); And for some reason, that messes up metadata after startup is completed. I removed the |
I'll see if I can reproduce it.
Messes it up in what way?
I was writing up an explanation here about doing the check/update/prune work for metadata when Cryostat restarts and the target applications around it may or may not have changed, but actually I think the existing mechanism for emitting target events when the discovery plugins perform updates against the database should already cover this case now that I think about it and write it out. The discovery database should contain information about what each builtin plugin knew at the time Cryostat went offline, and when Cryostat comes back online and restarts that builtin plugin, it will query the platform for whatever the current state is. If the set of applications reported here differs from the previous state in the database then that should emit a target discovery event, which should in turn call I'll go back and review my own work, the data flow, and the interaction between components here. I think removing the |
Hmm... I'm not sure what I encountered before in terms of the actual bug and I can't seem to recreate it anymore, so might have just been a problem on my end. I will continue to test and try it using Openshift-like deployments as well. Though, I agree that the target discovery events Aside, there are some little fixes that I think should be put into the newest release, how should that work, do I just pull request against that branch? |
Fixes #1106
Fixes #1111
This isn't (yet) properly/fully async. The IDs are computed "async" but the returned
Future
s are often just immediately.get()
'd. However, they are computed using a Caffeine AsyncLoadingCache very similarly to how theTargetConnectionManager
works, and they are in fact computed using theexecuteConnectedTaskAsync
from the connection manager. The metadata manageraddTargetDiscoveryListener
hook also pushes off its work onto worker threads from theForkJoinPool
so that the notifying thread (which could be the JDP listener thread, or an HTTP webserver thread from a Custom Target or Discovery API request) is not blocked on that task, which performs file I/O and potentially also opens a JMX connection to the target application.Currently, archived recordings' metadata is not properly restored when Cryostat is restarted. I haven't checked if metadata for active recordings is transferred, either.