fix(metadata): check/prune metadata and compute JVM IDs async #1105

andrewazores · 2022-10-07T00:45:06Z

handle metadata computation and migration asynchronously
compute JVM IDs asynchronously

This isn't (yet) properly/fully async. The IDs are computed "async" but the returned Futures are often just immediately .get()'d. However, they are computed using a Caffeine AsyncLoadingCache very similarly to how the TargetConnectionManager works, and they are in fact computed using the executeConnectedTaskAsync from the connection manager. The metadata manager addTargetDiscoveryListener hook also pushes off its work onto worker threads from the ForkJoinPool so that the notifying thread (which could be the JDP listener thread, or an HTTP webserver thread from a Custom Target or Discovery API request) is not blocked on that task, which performs file I/O and potentially also opens a JMX connection to the target application.

Currently, archived recordings' metadata is not properly restored when Cryostat is restarted. I haven't checked if metadata for active recordings is transferred, either.

andrewazores · 2022-10-07T00:50:19Z

@maxcao13 this is not ready for proper review just yet (see note at the end of the PR body), but I wanted to get your eyes on this early. I think you're more familiar with the metadata transferring stuff in here than I am - do you see any obvious cause for the broken archive metadata transfer? If not, I'll take the deep dive in to see what/how I broke.

maxcao13 · 2022-10-07T00:54:19Z

I'll take a look.

maxcao13 · 2022-10-07T01:17:48Z

There's nothing that seems to be "wrong" from what I see. Seems like the functionality should've been kept the same. Just testing it though, I seem to get some errors with the CredentialsManager?

Exception in thread "ForkJoinPool.commonPool-worker-4" java.lang.NullPointerException
	at java.base/java.util.Objects.requireNonNull(Objects.java:208)
	at org.openjdk.nashorn.internal.runtime.Source$RawData.<init>(Source.java:170)
	at org.openjdk.nashorn.internal.runtime.Source.sourceFor(Source.java:433)
	at org.openjdk.nashorn.internal.runtime.Source.sourceFor(Source.java:444)
	at org.openjdk.nashorn.api.scripting.NashornScriptEngine.makeSource(NashornScriptEngine.java:222)
	at org.openjdk.nashorn.api.scripting.NashornScriptEngine.eval(NashornScriptEngine.java:151)
	at java.scripting/javax.script.AbstractScriptEngine.eval(AbstractScriptEngine.java:231)
	at io.cryostat.rules.MatchExpressionEvaluator.applies(MatchExpressionEvaluator.java:67)
	at io.cryostat.configuration.CredentialsManager.getCredentials(CredentialsManager.java:196)
	at io.cryostat.recordings.RecordingMetadataManager.getConnectionDescriptorWithCredentials(RecordingMetadataManager.java:827)
	at io.cryostat.recordings.RecordingMetadataManager.lambda$accept$8(RecordingMetadataManager.java:405)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1395)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)

But sometimes the metadata is correctly being transferred and sometimes not. I think I would look closer at credentials and its being used for the jvmId which in turn is used for archives/metadata transferring for this issue.

andrewazores · 2022-10-07T15:20:34Z

Hmm, okay. Thanks for checking. I hadn't seen that specific failure before so I'll look into that at least. That stack trace looks awfully like things I was seeing after #1083 and that I thought I fixed in #1096, though, where stored credentials come back out of the DAO with nulled out matchexpressons :-/

andrewazores · 2022-10-07T15:35:29Z

Yikes:

Oct 07, 2022 3:33:09 PM io.cryostat.core.log.Logger info
INFO: Outgoing WS message: {"meta":{"category":"TargetJvmDiscovery","type":{"type":"application","subType":"json"},"serverTime":1665156789},"message":{"event":{"kind":"FOUND","serviceRef":{"connectUrl":"service:jmx:rmi:///jndi/rmi://cryostat:9096/jmxrmi","alias":"/deployments/quarkus-run.jar","labels":{},"annotations":{"platform":{},"cryostat":{"HOST":"cryostat","PORT":"9096","JAVA_MAIN":"/deployments/quarkus-run.jar","REALM":"JDP"}}}}}}
Oct 07, 2022 3:33:09 PM io.cryostat.core.log.Logger error
SEVERE: Archives subdirectory could not be renamed upon target restart
java.util.concurrent.ExecutionException: java.nio.file.NoSuchFileException: /opt/cryostat.d/recordings.d/GBLHOMT2JJ4UWVKFKUYTC6LIM5GUQNKXMRHFSXZNORXDA2SGMMYTS3SWKBLGWMTVJF3VSPI=
	at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
	at io.cryostat.recordings.RecordingArchiveHelper.transferArchivesIfRestarted(RecordingArchiveHelper.java:190)
	at io.cryostat.recordings.RecordingMetadataManager.lambda$accept$8(RecordingMetadataManager.java:424)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1395)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
Caused by: java.nio.file.NoSuchFileException: /opt/cryostat.d/recordings.d/GBLHOMT2JJ4UWVKFKUYTC6LIM5GUQNKXMRHFSXZNORXDA2SGMMYTS3SWKBLGWMTVJF3VSPI=
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
	at java.base/sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:440)
	at java.base/java.nio.file.Files.newDirectoryStream(Files.java:482)
	at java.base/java.nio.file.Files.list(Files.java:3792)
	at io.cryostat.core.sys.FileSystem.listDirectoryChildren(FileSystem.java:98)
	at io.cryostat.recordings.RecordingArchiveHelper.getConnectUrlFromPath(RecordingArchiveHelper.java:226)
	... 8 more

Hibernate: 
    select
        plugininfo0_.id as id1_0_0_,
        plugininfo0_.callback as callback2_0_0_,
        plugininfo0_.realm as realm3_0_0_,
        plugininfo0_.subtree as subtree4_0_0_ 
    from
        PluginInfo plugininfo0_ 
    where
        plugininfo0_.id=?
Hibernate: 
    /* load io.cryostat.discovery.PluginInfo */ select
        plugininfo0_.id as id1_0_0_,
        plugininfo0_.callback as callback2_0_0_,
        plugininfo0_.realm as realm3_0_0_,
        plugininfo0_.subtree as subtree4_0_0_ 
    from
        PluginInfo plugininfo0_ 
    where
        plugininfo0_.id=?
Hibernate: 
    /* update
        io.cryostat.discovery.PluginInfo */ update
            PluginInfo 
        set
            callback=?,
            realm=?,
            subtree=? 
        where
            id=?
Oct 07, 2022 3:33:09 PM io.cryostat.core.log.Logger info
INFO: Active recording lost auto_myrule, deleting...
Oct 07, 2022 3:33:09 PM io.cryostat.core.log.Logger info
INFO: Deleted metadata file /opt/cryostat.d/conf.d/metadata/NZ2HI6RZOZXHQSDNGYZVI6JTHFJUO2KCJUZGO3LYJF3GYWJVFVHWMNKNFV5FCUCDKFGDQPI=/MF2XI327NV4XE5LMMU======.json
Hibernate: 
    /* update
        io.cryostat.discovery.PluginInfo */ update
            PluginInfo 
        set
            callback=?,
            realm=?,
            subtree=? 
        where
            id=?
Oct 07, 2022 3:33:09 PM io.cryostat.core.log.Logger error
SEVERE: Exception thrown
javax.persistence.RollbackException: Error while committing the transaction
	at org.hibernate.internal.ExceptionConverterImpl.convertCommitException(ExceptionConverterImpl.java:81)
	at org.hibernate.engine.transaction.internal.TransactionImpl.commit(TransactionImpl.java:104)
	at io.cryostat.discovery.PluginInfoDao.update(PluginInfoDao.java:132)
	at io.cryostat.discovery.DiscoveryStorage.update(DiscoveryStorage.java:234)
	at io.cryostat.discovery.BuiltInDiscovery.lambda$start$1(BuiltInDiscovery.java:107)
	at io.cryostat.platform.AbstractPlatformClient.lambda$notifyAsyncTargetDiscovery$0(AbstractPlatformClient.java:65)
	at java.base/java.lang.Iterable.forEach(Iterable.java:75)
	at io.cryostat.platform.AbstractPlatformClient.notifyAsyncTargetDiscovery(AbstractPlatformClient.java:65)
	at io.cryostat.platform.internal.DefaultPlatformClient.accept(DefaultPlatformClient.java:96)
	at io.cryostat.platform.internal.DefaultPlatformClient.accept(DefaultPlatformClient.java:65)
	at io.cryostat.core.net.discovery.JvmDiscoveryClient$1.lambda$onDiscovery$0(JvmDiscoveryClient.java:75)
	at java.base/java.lang.Iterable.forEach(Iterable.java:75)
	at io.cryostat.core.net.discovery.JvmDiscoveryClient$1.onDiscovery(JvmDiscoveryClient.java:73)
	at org.openjdk.jmc.jdp.client.PacketProcessor.fireEvent(PacketProcessor.java:103)
	at org.openjdk.jmc.jdp.client.PacketProcessor.process(PacketProcessor.java:82)
	at org.openjdk.jmc.jdp.client.PacketListener.run(PacketListener.java:77)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.util.ConcurrentModificationException
	at java.base/java.util.ArrayList$Itr.checkForComodification(ArrayList.java:1013)
	at java.base/java.util.ArrayList$Itr.next(ArrayList.java:967)
	at java.base/java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1054)
	at org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:602)
	at org.hibernate.engine.spi.ActionQueue.lambda$executeActions$1(ActionQueue.java:478)
	at java.base/java.util.LinkedHashMap.forEach(LinkedHashMap.java:721)
	at org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:475)
	at org.hibernate.event.internal.AbstractFlushingEventListener.performExecutions(AbstractFlushingEventListener.java:344)
	at org.hibernate.event.internal.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:40)
	at org.hibernate.event.service.internal.EventListenerGroupImpl.fireEventOnEachListener(EventListenerGroupImpl.java:107)
	at org.hibernate.internal.SessionImpl.doFlush(SessionImpl.java:1407)
	at org.hibernate.internal.SessionImpl.managedFlush(SessionImpl.java:489)
	at org.hibernate.internal.SessionImpl.flushBeforeTransactionCompletion(SessionImpl.java:3290)
	at org.hibernate.internal.SessionImpl.beforeTransactionCompletion(SessionImpl.java:2425)
	at org.hibernate.engine.jdbc.internal.JdbcCoordinatorImpl.beforeTransactionCompletion(JdbcCoordinatorImpl.java:449)
	at org.hibernate.resource.transaction.backend.jdbc.internal.JdbcResourceLocalTransactionCoordinatorImpl.beforeCompletionCallback(JdbcResourceLocalTransactionCoordinatorImpl.java:183)
	at org.hibernate.resource.transaction.backend.jdbc.internal.JdbcResourceLocalTransactionCoordinatorImpl.access$300(JdbcResourceLocalTransactionCoordinatorImpl.java:40)
	at org.hibernate.resource.transaction.backend.jdbc.internal.JdbcResourceLocalTransactionCoordinatorImpl$TransactionDriverControlImpl.commit(JdbcResourceLocalTransactionCoordinatorImpl.java:281)
	at org.hibernate.engine.transaction.internal.TransactionImpl.commit(TransactionImpl.java:101)
	... 15 more

Oct 07, 2022 3:33:09 PM org.hibernate.engine.jdbc.spi.SqlExceptionHelper logExceptions
WARN: SQL Error: 90007, SQLState: 90007
Oct 07, 2022 3:33:09 PM org.hibernate.engine.jdbc.spi.SqlExceptionHelper logExceptions
ERROR: The object is already closed [90007-214]
Oct 07, 2022 3:33:09 PM io.cryostat.core.log.Logger error
SEVERE: Could not get credentials for targetId service:jmx:rmi:///jndi/rmi://cryostat:9093/jmxrmi, msg: org.hibernate.exception.GenericJDBCException: could not update: [io.cryostat.discovery.PluginInfo#c18b1a62-5710-4172-83a0-c5330a774393]
Oct 07, 2022 3:33:09 PM io.cryostat.core.log.Logger info
INFO: Successfully pruned all stale metadata

maxcao13 · 2022-10-07T16:12:40Z

Yikes:

omitted...

Ah this was a bug I have fixed in my other PR.
Fixed a bug where a connectUrl FileNotFound exception is thrown if a target is discovered but didn't already have a archives folder created for it. Sorry about that!

Solved by just adding this if statement

    protected void transferArchivesIfRestarted(Path subdirectoryPath, String oldJvmId) {
        try {
        try {
            if (!fs.exists(subdirectoryPath)) {
                logger.info("No archived recordings found, skipping transfer");
                return;
// ...
// ...

Not sure if the RollbackExceptions or ConcurrentModificationExceptions are related however...

andrewazores · 2022-10-07T16:27:15Z

No, the rollback and concurrent modification are more likely results of #1083...

maxcao13 · 2022-10-07T22:03:22Z

Not sure if this helps, but I've found a bug on my PR where the RecordingMetadataManager.accept oldJvmId is not actually "old" when the MetadataManager hooks into the TargetDiscoveryEvents, since for me the DiscoveryStorage has already computed the newJvmId before emitting the notification. That is what breaks my archive/metadata transferring, there may be something similar here if you haven't already figured it out.

andrewazores · 2022-10-07T22:36:13Z

I think I solved the migration and transfer issues I was having before. I think this improves the startup performance a bunch as well as making a few things non-blocking under the hood so that that Custom Targets case I described in the linked issue is improved.

Somewhat off-topic: While working on this I had the realization that in the important deployment scenarios, ex. k8s/OpenShift, the JMX Service URL is generally going to contain a service IP address rather than some resolvable DNS hostname. This means that the migration/transfer work we do is unlikely to be triggered to begin with, because a restarted instance may not appear to be "the same" target, as both its JMX Service URL and its JVM ID will have changed. In that kind of case I think it's OK that we don't associate the old archives to the new target instance, however, we DO need to provide the user a way to access those archives still. Just because the target has gone offline and wasn't replaced doesn't mean the user doesn't still need those archives. That's mainly a frontend problem, but we might need to tweak some of our queries to suit.

@maxcao13 @tthvo any comments on that last note ^ ? That part really needs to make it into the release but probably has to come after this PR, so it's a bit tight timing.

maxcao13 · 2022-10-07T22:53:07Z

Somewhat off-topic: While working on this I had the realization that in the important deployment scenarios, ex. k8s/OpenShift, the JMX Service URL is generally going to contain a service IP address rather than some resolvable DNS hostname. This means that the migration/transfer work we do is unlikely to be triggered to begin with, because a restarted instance may not appear to be "the same" target, as both its JMX Service URL and its JVM ID will have changed. In that kind of case I think it's OK that we don't associate the old archives to the new target instance, however, we DO need to provide the user a way to access those archives still. Just because the target has gone offline and wasn't replaced doesn't mean the user doesn't still need those archives. That's mainly a frontend problem, but we might need to tweak some of our queries to suit.

Sure, that makes sense, it sort of sucks that these targets might not have the same URL and might not actually be able to transfer archives/metadata, but I assume when using k8s/openshift, there is a way for us to figure out if a restarted instance was actually restarted without using the ServiceURL?

andrewazores · 2022-10-07T23:04:13Z

Somewhat off-topic: While working on this I had the realization that in the important deployment scenarios, ex. k8s/OpenShift, the JMX Service URL is generally going to contain a service IP address rather than some resolvable DNS hostname. This means that the migration/transfer work we do is unlikely to be triggered to begin with, because a restarted instance may not appear to be "the same" target, as both its JMX Service URL and its JVM ID will have changed. In that kind of case I think it's OK that we don't associate the old archives to the new target instance, however, we DO need to provide the user a way to access those archives still. Just because the target has gone offline and wasn't replaced doesn't mean the user doesn't still need those archives. That's mainly a frontend problem, but we might need to tweak some of our queries to suit.

Sure, that makes sense, it sort of sucks that these targets might not have the same URL and might not actually be able to transfer archives/metadata, but I assume when using k8s/openshift, there is a way for us to figure out if a restarted instance was actually restarted without using the ServiceURL?

Maybe? Even if it is possible though, I think it would make sense in that case to keep the data separated out. I have a hunch that any mechanism we implement to try to detect that case will also be very prone to confusing two different (and simultaneous) replicas of the same container with each other. Keeping the archived copies separate in this instance might also be useful for the user to be able to correlate recordings to restarted instances - the JFR data in the archives might actually be helpful in determining why the instance restarted, after all, since it may have done so because the Java program within crashed.

Anyway, I think adding a query akin to the old-style GET /api/v1/recordings is what we need to ensure the user can find those recordings from old targets that we don't see any replacement for. IMO this should probably be the new underlying query for the frontend's All-Archives view. Rather than getting the list of known targets and then grabbing the corresponding archived recording information for those targets only, it needs to be essentially just a view of the archives and the subdirectories, with no consideration for which of those correspond to currently-known targets.

This puts us back to square one when it comes to our plan to ensure users cannot view archived recordings from other namespaces, but that's still a deeper problem that needs solving in the next release(s).

andrewazores · 2022-10-07T23:09:36Z

Oh, and back on-topic for this specific PR: I made a change so that the JPA/Hibernate EntityManager is a singleton and accesses to it through the DAOs are synchronized on the entity manager instance. This is definitely not the ideal way to use these things, and definitely leads to lower throughput and probably extra things being cached in the Hibernate layer than necessary, but I haven't run into any data consistency issues (nulled out matchExpressions from stored credentials, concurrent modification exceptions on discovery updates, etc.) since.

maxcao13 · 2022-10-11T15:42:39Z

Anyway, I think adding a query akin to the old-style GET /api/v1/recordings is what we need to ensure the user can find those recordings from old targets that we don't see any replacement for. IMO this should probably be the new underlying query for the frontend's All-Archives view. Rather than getting the list of known targets and then grabbing the corresponding archived recording information for those targets only, it needs to be essentially just a view of the archives and the subdirectories, with no consideration for which of those correspond to currently-known targets.

Okay, I will make an issue and start on that.

andrewazores · 2022-10-11T18:21:52Z

https://github.com/cryostatio/cryostat/actions/runs/3228695054/jobs/5285820375#step:8:2300

I'm trying to reproduce this CI failure locally with bash repeated-integration-tests.bash 5 InterleavedExternalTargetRequestsIT but it seems to run cleanly and succeed every time. I suspect this is actually occurring due to a cleanup failure elsewhere, perhaps in the AutoRulesIT.

I'll take some more time to try bash repeated-integration-tests.bash 5 AutoRulesIT,InterleavedExternalTargetRequestsIT etc. to see if I can narrow down the specific problem.

maxcao13

Other than the itest, and some questions/comments, everything seems to work and preserve the synchronous functionality whie increasing throughput and etc. Thanks for doing this!

There is a bug, probably not caused by this PR (and probably front-end related haven't checked main yet)but basically if you go to the archives view and have a target with archived recordings, wait for a autoRefresh, and the target and its archived recordings disappear.

EDIT: Currently happens on main.

src/main/java/io/cryostat/discovery/PluginInfoDao.java

src/main/java/io/cryostat/recordings/JvmIdHelper.java

src/main/java/io/cryostat/recordings/RecordingMetadataManager.java

tthvo · 2022-10-11T18:53:51Z

There is a bug, probably not caused by this PR (and probably front-end related haven't checked main yet)but basically if you go to the archives view and have a target with archived recordings, wait for a auto (with that setting enabled) refreshRecordingList React.useEffect(), and the target and its archived recordings disappear.

Looks like this happens only in All Archive. I will take a look.

src/main/java/io/cryostat/recordings/RecordingMetadataManager.java

andrewazores · 2022-10-11T19:51:40Z

https://github.com/cryostatio/cryostat/actions/runs/3228695054/jobs/5285820375#step:8:2300

I'm trying to reproduce this CI failure locally with bash repeated-integration-tests.bash 5 InterleavedExternalTargetRequestsIT but it seems to run cleanly and succeed every time. I suspect this is actually occurring due to a cleanup failure elsewhere, perhaps in the AutoRulesIT.

I'll take some more time to try bash repeated-integration-tests.bash 5 AutoRulesIT,InterleavedExternalTargetRequestsIT etc. to see if I can narrow down the specific problem.

bash repeated-integration-tests.bash 10 InterleavedExternalTargetRequestsIT worked for me with no failures locally. bash repeated-integration-tests.bash 10 AutoRulesIT,InterleavedExternalTargetRequestsIT has lots of failures where there are more recordings on the targets than expected - I'm a little distracted so I haven't thoroughly checked but it appears that if AutoRulesIT runs first, then InterleavedExternalTargetRequestsIT fails on run afterward. Seems like an AutoRulesIT cleanup problem for sure.

…ls or rules are deleted

also removes in-memory metadata map model. use only local filesystem instead. this will be slower, but maintaining two copies of the model is difficult to maintain. later this should be moved into the database and accessed with a DAO

…ry database at startup time

maxcao13

Barring that one specific issue, looks good to me. Great work!

andrewazores · 2022-10-12T21:28:20Z

We'll have to keep an eye on that issue and anything else that comes up in the next couple of weeks. If we find anything seriously broken there is a little time after feature freeze and the upstream branching where we can push important bugfixes to the version branch. I would really appreciate if you and @tthvo can try various different scenarios exercising this code, particularly in OpenShift, and see if any other problems appear.

maxcao13 · 2022-10-14T22:30:44Z

I think everything looks good and I don't see any bugs from testing.
One thing I'm not sure about if it's a contrived example or not, but if you use smoketest and create a custom target with url localhost:0 and archive a recording using that custom target, and restart cryostat, the archived recording won't show up later on either service//:...cryostat:9091 or localhost:0.
But this seems to only happen with localhost:0 or other localhost prefixed urls, using a url like cryostat:9093 is okay.

Still working on this. I think this is actually a problem in general with any targets that appear in Discovery Storage immediately at startup rather than being discovered later on, so it's a pretty significant thing that needs to work.

Coming back to this, I seem to have problems with the current commit.

Sometimes, a few targets are unable to be connected to when doing

logger.info("Starting archive migration");
archiveHelper.migrate(executor);
logger.info("Successfully migrated archives");
pruneStaleMetadata(staleMetadata);
logger.info("Successfully pruned all stale metadata");
platformClient.listDiscoverableServices().stream().forEach(this::handleFoundTarget);

And for some reason, that messes up metadata after startup is completed.

I removed the handleFoundTarget on each discovered service and it seemed to go back to normal. I realize I didn't understand why this was needed, could you explain again, sorry about that.

andrewazores · 2022-10-16T20:46:22Z

I'll see if I can reproduce it.

And for some reason, that messes up metadata after startup is completed.

Messes it up in what way?

I realize I didn't understand why this was needed, could you explain again, sorry about that.

I was writing up an explanation here about doing the check/update/prune work for metadata when Cryostat restarts and the target applications around it may or may not have changed, but actually I think the existing mechanism for emitting target events when the discovery plugins perform updates against the database should already cover this case now that I think about it and write it out.

The discovery database should contain information about what each builtin plugin knew at the time Cryostat went offline, and when Cryostat comes back online and restarts that builtin plugin, it will query the platform for whatever the current state is. If the set of applications reported here differs from the previous state in the database then that should emit a target discovery event, which should in turn call accept(TargetDiscoveryEvent) on the metadata manager, which effectively queues the check/update/prune work for that target for processing by a worker thread.

I'll go back and review my own work, the data flow, and the interaction between components here. I think removing the handleFoundTarget at startup as you've tried might be OK.

maxcao13 · 2022-10-17T17:36:26Z

I'll see if I can reproduce it.

And for some reason, that messes up metadata after startup is completed.

Messes it up in what way?

I realize I didn't understand why this was needed, could you explain again, sorry about that.

I was writing up an explanation here about doing the check/update/prune work for metadata when Cryostat restarts and the target applications around it may or may not have changed, but actually I think the existing mechanism for emitting target events when the discovery plugins perform updates against the database should already cover this case now that I think about it and write it out.

The discovery database should contain information about what each builtin plugin knew at the time Cryostat went offline, and when Cryostat comes back online and restarts that builtin plugin, it will query the platform for whatever the current state is. If the set of applications reported here differs from the previous state in the database then that should emit a target discovery event, which should in turn call accept(TargetDiscoveryEvent) on the metadata manager, which effectively queues the check/update/prune work for that target for processing by a worker thread.

I'll go back and review my own work, the data flow, and the interaction between components here. I think removing the handleFoundTarget at startup as you've tried might be OK.

Hmm... I'm not sure what I encountered before in terms of the actual bug and I can't seem to recreate it anymore, so might have just been a problem on my end. I will continue to test and try it using Openshift-like deployments as well.

Though, I agree that the target discovery events accept hook in the metadataManager should probably cover the case you are talking about, since if the target was updated, then the event would be emitted to cover it, and if it didn't then we wouldn't need to handle the target anyways.

Aside, there are some little fixes that I think should be put into the newest release, how should that work, do I just pull request against that branch?

andrewazores added the fix label Oct 7, 2022

andrewazores mentioned this pull request Oct 7, 2022

feat(graphql): add jvmId to ServiceRef and expose id to GraphQL queries #1089

Merged

andrewazores force-pushed the async-metadata branch from 9f0b287 to 5bc77a6 Compare October 7, 2022 23:05

andrewazores force-pushed the async-metadata branch from 5bc77a6 to 1378c88 Compare October 7, 2022 23:28

andrewazores marked this pull request as ready for review October 7, 2022 23:28

andrewazores force-pushed the async-metadata branch from 61ee822 to 2570d98 Compare October 11, 2022 14:18

maxcao13 mentioned this pull request Oct 11, 2022

[Task] Add a view for users to find all archives without target-specificity cryostatio/cryostat-web#546

Closed

andrewazores requested a review from maxcao13 October 11, 2022 16:49

andrewazores force-pushed the async-metadata branch from f10eb52 to 01b1594 Compare October 11, 2022 16:49

maxcao13 reviewed Oct 11, 2022

View reviewed changes

src/main/java/io/cryostat/recordings/RecordingMetadataManager.java Outdated Show resolved Hide resolved

tthvo mentioned this pull request Oct 11, 2022

[Task] Refreshing causes incorrect display of All-Archives view cryostatio/cryostat-web#549

Closed

maxcao13 reviewed Oct 11, 2022

View reviewed changes

src/main/java/io/cryostat/recordings/RecordingMetadataManager.java Show resolved Hide resolved

maxcao13 mentioned this pull request Oct 11, 2022

feat(archives): add functionality to view archived files that have no recognizable targetId/serviceUri #1110

Merged

andrewazores added 22 commits October 12, 2022 20:45

memoize matchExpressionEvaluator result

d859333

slightly increase JDP timeout

db838d3

spotbugs fix

a77b180

exception message correction

c633e9b

use longer matchExpression cache TTL

78e1648

emit REMOVED for deletions by ID

b44d77c

cache matchExpression results by max cache size, evict when credentia…

f8c67e0

…ls or rules are deleted

remove logger lines

df9e4ab

remove exception concrete type specific logger messages

db5adca

re-enable credentials handling

50e6392

delete rule after test run

d97ad68

refactoring, remove test that doesn't test what it was intended to

a1cf561

add post-run test to ensure no rules left dangling

5bff811

clean up cleanup steps post-test

8ba6cbc

fix broken ruleregistry removal

37df69f

apply spotless

31f3d6b

refactor, don't store connectUrl file copy in metadata subdirs

4eebf47

refactor to remove metadata connectUrl file

c8f2a21

also removes in-memory metadata map model. use only local filesystem instead. this will be slower, but maintaining two copies of the model is difficult to maintain. later this should be moved into the database and accessed with a DAO

don't log uploads directories' "id"s

cbe7718

refactor cleanup

7b7b1a6

fix broken metadata transfer due to filename decoding mistake

1c6d1eb

handle metadata transfer for targets already observable or in discove…

de2b004

…ry database at startup time

andrewazores force-pushed the async-metadata branch from 8a2c84b to de2b004 Compare October 12, 2022 20:45

maxcao13 approved these changes Oct 12, 2022

View reviewed changes

andrewazores merged commit 801779d into cryostatio:main Oct 12, 2022

andrewazores deleted the async-metadata branch October 12, 2022 21:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(metadata): check/prune metadata and compute JVM IDs async #1105

fix(metadata): check/prune metadata and compute JVM IDs async #1105

andrewazores commented Oct 7, 2022 •

edited

andrewazores commented Oct 7, 2022

maxcao13 commented Oct 7, 2022

maxcao13 commented Oct 7, 2022

andrewazores commented Oct 7, 2022 •

edited

andrewazores commented Oct 7, 2022

maxcao13 commented Oct 7, 2022 •

edited

andrewazores commented Oct 7, 2022

maxcao13 commented Oct 7, 2022 •

edited

andrewazores commented Oct 7, 2022

maxcao13 commented Oct 7, 2022

andrewazores commented Oct 7, 2022

andrewazores commented Oct 7, 2022

maxcao13 commented Oct 11, 2022

andrewazores commented Oct 11, 2022

maxcao13 left a comment •

edited

tthvo commented Oct 11, 2022

andrewazores commented Oct 11, 2022

maxcao13 left a comment

andrewazores commented Oct 12, 2022

maxcao13 commented Oct 14, 2022 •

edited

andrewazores commented Oct 16, 2022

maxcao13 commented Oct 17, 2022 •

edited

fix(metadata): check/prune metadata and compute JVM IDs async #1105

fix(metadata): check/prune metadata and compute JVM IDs async #1105

Conversation

andrewazores commented Oct 7, 2022 • edited

andrewazores commented Oct 7, 2022

maxcao13 commented Oct 7, 2022

maxcao13 commented Oct 7, 2022

andrewazores commented Oct 7, 2022 • edited

andrewazores commented Oct 7, 2022

maxcao13 commented Oct 7, 2022 • edited

andrewazores commented Oct 7, 2022

maxcao13 commented Oct 7, 2022 • edited

andrewazores commented Oct 7, 2022

maxcao13 commented Oct 7, 2022

andrewazores commented Oct 7, 2022

andrewazores commented Oct 7, 2022

maxcao13 commented Oct 11, 2022

andrewazores commented Oct 11, 2022

maxcao13 left a comment • edited

Choose a reason for hiding this comment

tthvo commented Oct 11, 2022

andrewazores commented Oct 11, 2022

maxcao13 left a comment

Choose a reason for hiding this comment

andrewazores commented Oct 12, 2022

maxcao13 commented Oct 14, 2022 • edited

andrewazores commented Oct 16, 2022

maxcao13 commented Oct 17, 2022 • edited

andrewazores commented Oct 7, 2022 •

edited

andrewazores commented Oct 7, 2022 •

edited

maxcao13 commented Oct 7, 2022 •

edited

maxcao13 commented Oct 7, 2022 •

edited

maxcao13 left a comment •

edited

maxcao13 commented Oct 14, 2022 •

edited

maxcao13 commented Oct 17, 2022 •

edited