Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(metadata): check/prune metadata and compute JVM IDs async #1105

Merged
merged 42 commits into from
Oct 12, 2022

Conversation

andrewazores
Copy link
Member

@andrewazores andrewazores commented Oct 7, 2022

Fixes #1106
Fixes #1111

  • handle metadata computation and migration asynchronously
  • compute JVM IDs asynchronously

This isn't (yet) properly/fully async. The IDs are computed "async" but the returned Futures are often just immediately .get()'d. However, they are computed using a Caffeine AsyncLoadingCache very similarly to how the TargetConnectionManager works, and they are in fact computed using the executeConnectedTaskAsync from the connection manager. The metadata manager addTargetDiscoveryListener hook also pushes off its work onto worker threads from the ForkJoinPool so that the notifying thread (which could be the JDP listener thread, or an HTTP webserver thread from a Custom Target or Discovery API request) is not blocked on that task, which performs file I/O and potentially also opens a JMX connection to the target application.

Currently, archived recordings' metadata is not properly restored when Cryostat is restarted. I haven't checked if metadata for active recordings is transferred, either.

@andrewazores
Copy link
Member Author

@maxcao13 this is not ready for proper review just yet (see note at the end of the PR body), but I wanted to get your eyes on this early. I think you're more familiar with the metadata transferring stuff in here than I am - do you see any obvious cause for the broken archive metadata transfer? If not, I'll take the deep dive in to see what/how I broke.

@maxcao13
Copy link
Member

maxcao13 commented Oct 7, 2022

I'll take a look.

@maxcao13
Copy link
Member

maxcao13 commented Oct 7, 2022

There's nothing that seems to be "wrong" from what I see. Seems like the functionality should've been kept the same. Just testing it though, I seem to get some errors with the CredentialsManager?

Exception in thread "ForkJoinPool.commonPool-worker-4" java.lang.NullPointerException
	at java.base/java.util.Objects.requireNonNull(Objects.java:208)
	at org.openjdk.nashorn.internal.runtime.Source$RawData.<init>(Source.java:170)
	at org.openjdk.nashorn.internal.runtime.Source.sourceFor(Source.java:433)
	at org.openjdk.nashorn.internal.runtime.Source.sourceFor(Source.java:444)
	at org.openjdk.nashorn.api.scripting.NashornScriptEngine.makeSource(NashornScriptEngine.java:222)
	at org.openjdk.nashorn.api.scripting.NashornScriptEngine.eval(NashornScriptEngine.java:151)
	at java.scripting/javax.script.AbstractScriptEngine.eval(AbstractScriptEngine.java:231)
	at io.cryostat.rules.MatchExpressionEvaluator.applies(MatchExpressionEvaluator.java:67)
	at io.cryostat.configuration.CredentialsManager.getCredentials(CredentialsManager.java:196)
	at io.cryostat.recordings.RecordingMetadataManager.getConnectionDescriptorWithCredentials(RecordingMetadataManager.java:827)
	at io.cryostat.recordings.RecordingMetadataManager.lambda$accept$8(RecordingMetadataManager.java:405)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1395)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)

But sometimes the metadata is correctly being transferred and sometimes not. I think I would look closer at credentials and its being used for the jvmId which in turn is used for archives/metadata transferring for this issue.

@andrewazores
Copy link
Member Author

andrewazores commented Oct 7, 2022

Hmm, okay. Thanks for checking. I hadn't seen that specific failure before so I'll look into that at least. That stack trace looks awfully like things I was seeing after #1083 and that I thought I fixed in #1096, though, where stored credentials come back out of the DAO with nulled out matchexpressons :-/

@andrewazores
Copy link
Member Author

Yikes:

Oct 07, 2022 3:33:09 PM io.cryostat.core.log.Logger info
INFO: Outgoing WS message: {"meta":{"category":"TargetJvmDiscovery","type":{"type":"application","subType":"json"},"serverTime":1665156789},"message":{"event":{"kind":"FOUND","serviceRef":{"connectUrl":"service:jmx:rmi:///jndi/rmi://cryostat:9096/jmxrmi","alias":"/deployments/quarkus-run.jar","labels":{},"annotations":{"platform":{},"cryostat":{"HOST":"cryostat","PORT":"9096","JAVA_MAIN":"/deployments/quarkus-run.jar","REALM":"JDP"}}}}}}
Oct 07, 2022 3:33:09 PM io.cryostat.core.log.Logger error
SEVERE: Archives subdirectory could not be renamed upon target restart
java.util.concurrent.ExecutionException: java.nio.file.NoSuchFileException: /opt/cryostat.d/recordings.d/GBLHOMT2JJ4UWVKFKUYTC6LIM5GUQNKXMRHFSXZNORXDA2SGMMYTS3SWKBLGWMTVJF3VSPI=
	at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
	at io.cryostat.recordings.RecordingArchiveHelper.transferArchivesIfRestarted(RecordingArchiveHelper.java:190)
	at io.cryostat.recordings.RecordingMetadataManager.lambda$accept$8(RecordingMetadataManager.java:424)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1395)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
Caused by: java.nio.file.NoSuchFileException: /opt/cryostat.d/recordings.d/GBLHOMT2JJ4UWVKFKUYTC6LIM5GUQNKXMRHFSXZNORXDA2SGMMYTS3SWKBLGWMTVJF3VSPI=
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
	at java.base/sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:440)
	at java.base/java.nio.file.Files.newDirectoryStream(Files.java:482)
	at java.base/java.nio.file.Files.list(Files.java:3792)
	at io.cryostat.core.sys.FileSystem.listDirectoryChildren(FileSystem.java:98)
	at io.cryostat.recordings.RecordingArchiveHelper.getConnectUrlFromPath(RecordingArchiveHelper.java:226)
	... 8 more

Hibernate: 
    select
        plugininfo0_.id as id1_0_0_,
        plugininfo0_.callback as callback2_0_0_,
        plugininfo0_.realm as realm3_0_0_,
        plugininfo0_.subtree as subtree4_0_0_ 
    from
        PluginInfo plugininfo0_ 
    where
        plugininfo0_.id=?
Hibernate: 
    /* load io.cryostat.discovery.PluginInfo */ select
        plugininfo0_.id as id1_0_0_,
        plugininfo0_.callback as callback2_0_0_,
        plugininfo0_.realm as realm3_0_0_,
        plugininfo0_.subtree as subtree4_0_0_ 
    from
        PluginInfo plugininfo0_ 
    where
        plugininfo0_.id=?
Hibernate: 
    /* update
        io.cryostat.discovery.PluginInfo */ update
            PluginInfo 
        set
            callback=?,
            realm=?,
            subtree=? 
        where
            id=?
Oct 07, 2022 3:33:09 PM io.cryostat.core.log.Logger info
INFO: Active recording lost auto_myrule, deleting...
Oct 07, 2022 3:33:09 PM io.cryostat.core.log.Logger info
INFO: Deleted metadata file /opt/cryostat.d/conf.d/metadata/NZ2HI6RZOZXHQSDNGYZVI6JTHFJUO2KCJUZGO3LYJF3GYWJVFVHWMNKNFV5FCUCDKFGDQPI=/MF2XI327NV4XE5LMMU======.json
Hibernate: 
    /* update
        io.cryostat.discovery.PluginInfo */ update
            PluginInfo 
        set
            callback=?,
            realm=?,
            subtree=? 
        where
            id=?
Oct 07, 2022 3:33:09 PM io.cryostat.core.log.Logger error
SEVERE: Exception thrown
javax.persistence.RollbackException: Error while committing the transaction
	at org.hibernate.internal.ExceptionConverterImpl.convertCommitException(ExceptionConverterImpl.java:81)
	at org.hibernate.engine.transaction.internal.TransactionImpl.commit(TransactionImpl.java:104)
	at io.cryostat.discovery.PluginInfoDao.update(PluginInfoDao.java:132)
	at io.cryostat.discovery.DiscoveryStorage.update(DiscoveryStorage.java:234)
	at io.cryostat.discovery.BuiltInDiscovery.lambda$start$1(BuiltInDiscovery.java:107)
	at io.cryostat.platform.AbstractPlatformClient.lambda$notifyAsyncTargetDiscovery$0(AbstractPlatformClient.java:65)
	at java.base/java.lang.Iterable.forEach(Iterable.java:75)
	at io.cryostat.platform.AbstractPlatformClient.notifyAsyncTargetDiscovery(AbstractPlatformClient.java:65)
	at io.cryostat.platform.internal.DefaultPlatformClient.accept(DefaultPlatformClient.java:96)
	at io.cryostat.platform.internal.DefaultPlatformClient.accept(DefaultPlatformClient.java:65)
	at io.cryostat.core.net.discovery.JvmDiscoveryClient$1.lambda$onDiscovery$0(JvmDiscoveryClient.java:75)
	at java.base/java.lang.Iterable.forEach(Iterable.java:75)
	at io.cryostat.core.net.discovery.JvmDiscoveryClient$1.onDiscovery(JvmDiscoveryClient.java:73)
	at org.openjdk.jmc.jdp.client.PacketProcessor.fireEvent(PacketProcessor.java:103)
	at org.openjdk.jmc.jdp.client.PacketProcessor.process(PacketProcessor.java:82)
	at org.openjdk.jmc.jdp.client.PacketListener.run(PacketListener.java:77)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.util.ConcurrentModificationException
	at java.base/java.util.ArrayList$Itr.checkForComodification(ArrayList.java:1013)
	at java.base/java.util.ArrayList$Itr.next(ArrayList.java:967)
	at java.base/java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1054)
	at org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:602)
	at org.hibernate.engine.spi.ActionQueue.lambda$executeActions$1(ActionQueue.java:478)
	at java.base/java.util.LinkedHashMap.forEach(LinkedHashMap.java:721)
	at org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:475)
	at org.hibernate.event.internal.AbstractFlushingEventListener.performExecutions(AbstractFlushingEventListener.java:344)
	at org.hibernate.event.internal.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:40)
	at org.hibernate.event.service.internal.EventListenerGroupImpl.fireEventOnEachListener(EventListenerGroupImpl.java:107)
	at org.hibernate.internal.SessionImpl.doFlush(SessionImpl.java:1407)
	at org.hibernate.internal.SessionImpl.managedFlush(SessionImpl.java:489)
	at org.hibernate.internal.SessionImpl.flushBeforeTransactionCompletion(SessionImpl.java:3290)
	at org.hibernate.internal.SessionImpl.beforeTransactionCompletion(SessionImpl.java:2425)
	at org.hibernate.engine.jdbc.internal.JdbcCoordinatorImpl.beforeTransactionCompletion(JdbcCoordinatorImpl.java:449)
	at org.hibernate.resource.transaction.backend.jdbc.internal.JdbcResourceLocalTransactionCoordinatorImpl.beforeCompletionCallback(JdbcResourceLocalTransactionCoordinatorImpl.java:183)
	at org.hibernate.resource.transaction.backend.jdbc.internal.JdbcResourceLocalTransactionCoordinatorImpl.access$300(JdbcResourceLocalTransactionCoordinatorImpl.java:40)
	at org.hibernate.resource.transaction.backend.jdbc.internal.JdbcResourceLocalTransactionCoordinatorImpl$TransactionDriverControlImpl.commit(JdbcResourceLocalTransactionCoordinatorImpl.java:281)
	at org.hibernate.engine.transaction.internal.TransactionImpl.commit(TransactionImpl.java:101)
	... 15 more

Oct 07, 2022 3:33:09 PM org.hibernate.engine.jdbc.spi.SqlExceptionHelper logExceptions
WARN: SQL Error: 90007, SQLState: 90007
Oct 07, 2022 3:33:09 PM org.hibernate.engine.jdbc.spi.SqlExceptionHelper logExceptions
ERROR: The object is already closed [90007-214]
Oct 07, 2022 3:33:09 PM io.cryostat.core.log.Logger error
SEVERE: Could not get credentials for targetId service:jmx:rmi:///jndi/rmi://cryostat:9093/jmxrmi, msg: org.hibernate.exception.GenericJDBCException: could not update: [io.cryostat.discovery.PluginInfo#c18b1a62-5710-4172-83a0-c5330a774393]
Oct 07, 2022 3:33:09 PM io.cryostat.core.log.Logger info
INFO: Successfully pruned all stale metadata

@maxcao13
Copy link
Member

maxcao13 commented Oct 7, 2022

Yikes:

omitted...

Ah this was a bug I have fixed in my other PR.
Fixed a bug where a connectUrl FileNotFound exception is thrown if a target is discovered but didn't already have a archives folder created for it. Sorry about that!

Solved by just adding this if statement

    protected void transferArchivesIfRestarted(Path subdirectoryPath, String oldJvmId) {
        try {
        try {
            if (!fs.exists(subdirectoryPath)) {
                logger.info("No archived recordings found, skipping transfer");
                return;
// ...
// ...
        }

Not sure if the RollbackExceptions or ConcurrentModificationExceptions are related however...

@andrewazores
Copy link
Member Author

No, the rollback and concurrent modification are more likely results of #1083...

@maxcao13
Copy link
Member

maxcao13 commented Oct 7, 2022

Not sure if this helps, but I've found a bug on my PR where the RecordingMetadataManager.accept oldJvmId is not actually "old" when the MetadataManager hooks into the TargetDiscoveryEvents, since for me the DiscoveryStorage has already computed the newJvmId before emitting the notification. That is what breaks my archive/metadata transferring, there may be something similar here if you haven't already figured it out.

@andrewazores
Copy link
Member Author

I think I solved the migration and transfer issues I was having before. I think this improves the startup performance a bunch as well as making a few things non-blocking under the hood so that that Custom Targets case I described in the linked issue is improved.


Somewhat off-topic: While working on this I had the realization that in the important deployment scenarios, ex. k8s/OpenShift, the JMX Service URL is generally going to contain a service IP address rather than some resolvable DNS hostname. This means that the migration/transfer work we do is unlikely to be triggered to begin with, because a restarted instance may not appear to be "the same" target, as both its JMX Service URL and its JVM ID will have changed. In that kind of case I think it's OK that we don't associate the old archives to the new target instance, however, we DO need to provide the user a way to access those archives still. Just because the target has gone offline and wasn't replaced doesn't mean the user doesn't still need those archives. That's mainly a frontend problem, but we might need to tweak some of our queries to suit.

@maxcao13 @tthvo any comments on that last note ^ ? That part really needs to make it into the release but probably has to come after this PR, so it's a bit tight timing.

@maxcao13
Copy link
Member

maxcao13 commented Oct 7, 2022

Somewhat off-topic: While working on this I had the realization that in the important deployment scenarios, ex. k8s/OpenShift, the JMX Service URL is generally going to contain a service IP address rather than some resolvable DNS hostname. This means that the migration/transfer work we do is unlikely to be triggered to begin with, because a restarted instance may not appear to be "the same" target, as both its JMX Service URL and its JVM ID will have changed. In that kind of case I think it's OK that we don't associate the old archives to the new target instance, however, we DO need to provide the user a way to access those archives still. Just because the target has gone offline and wasn't replaced doesn't mean the user doesn't still need those archives. That's mainly a frontend problem, but we might need to tweak some of our queries to suit.

Sure, that makes sense, it sort of sucks that these targets might not have the same URL and might not actually be able to transfer archives/metadata, but I assume when using k8s/openshift, there is a way for us to figure out if a restarted instance was actually restarted without using the ServiceURL?

@andrewazores
Copy link
Member Author

Somewhat off-topic: While working on this I had the realization that in the important deployment scenarios, ex. k8s/OpenShift, the JMX Service URL is generally going to contain a service IP address rather than some resolvable DNS hostname. This means that the migration/transfer work we do is unlikely to be triggered to begin with, because a restarted instance may not appear to be "the same" target, as both its JMX Service URL and its JVM ID will have changed. In that kind of case I think it's OK that we don't associate the old archives to the new target instance, however, we DO need to provide the user a way to access those archives still. Just because the target has gone offline and wasn't replaced doesn't mean the user doesn't still need those archives. That's mainly a frontend problem, but we might need to tweak some of our queries to suit.

Sure, that makes sense, it sort of sucks that these targets might not have the same URL and might not actually be able to transfer archives/metadata, but I assume when using k8s/openshift, there is a way for us to figure out if a restarted instance was actually restarted without using the ServiceURL?

Maybe? Even if it is possible though, I think it would make sense in that case to keep the data separated out. I have a hunch that any mechanism we implement to try to detect that case will also be very prone to confusing two different (and simultaneous) replicas of the same container with each other. Keeping the archived copies separate in this instance might also be useful for the user to be able to correlate recordings to restarted instances - the JFR data in the archives might actually be helpful in determining why the instance restarted, after all, since it may have done so because the Java program within crashed.

Anyway, I think adding a query akin to the old-style GET /api/v1/recordings is what we need to ensure the user can find those recordings from old targets that we don't see any replacement for. IMO this should probably be the new underlying query for the frontend's All-Archives view. Rather than getting the list of known targets and then grabbing the corresponding archived recording information for those targets only, it needs to be essentially just a view of the archives and the subdirectories, with no consideration for which of those correspond to currently-known targets.

This puts us back to square one when it comes to our plan to ensure users cannot view archived recordings from other namespaces, but that's still a deeper problem that needs solving in the next release(s).

@andrewazores
Copy link
Member Author

Oh, and back on-topic for this specific PR: I made a change so that the JPA/Hibernate EntityManager is a singleton and accesses to it through the DAOs are synchronized on the entity manager instance. This is definitely not the ideal way to use these things, and definitely leads to lower throughput and probably extra things being cached in the Hibernate layer than necessary, but I haven't run into any data consistency issues (nulled out matchExpressions from stored credentials, concurrent modification exceptions on discovery updates, etc.) since.

@maxcao13
Copy link
Member

Anyway, I think adding a query akin to the old-style GET /api/v1/recordings is what we need to ensure the user can find those recordings from old targets that we don't see any replacement for. IMO this should probably be the new underlying query for the frontend's All-Archives view. Rather than getting the list of known targets and then grabbing the corresponding archived recording information for those targets only, it needs to be essentially just a view of the archives and the subdirectories, with no consideration for which of those correspond to currently-known targets.

Okay, I will make an issue and start on that.

@andrewazores
Copy link
Member Author

https://github.com/cryostatio/cryostat/actions/runs/3228695054/jobs/5285820375#step:8:2300

I'm trying to reproduce this CI failure locally with bash repeated-integration-tests.bash 5 InterleavedExternalTargetRequestsIT but it seems to run cleanly and succeed every time. I suspect this is actually occurring due to a cleanup failure elsewhere, perhaps in the AutoRulesIT.

I'll take some more time to try bash repeated-integration-tests.bash 5 AutoRulesIT,InterleavedExternalTargetRequestsIT etc. to see if I can narrow down the specific problem.

Copy link
Member

@maxcao13 maxcao13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the itest, and some questions/comments, everything seems to work and preserve the synchronous functionality whie increasing throughput and etc. Thanks for doing this!

There is a bug, probably not caused by this PR (and probably front-end related haven't checked main yet)but basically if you go to the archives view and have a target with archived recordings, wait for a autoRefresh, and the target and its archived recordings disappear.

EDIT: Currently happens on main.

@tthvo
Copy link
Member

tthvo commented Oct 11, 2022

There is a bug, probably not caused by this PR (and probably front-end related haven't checked main yet)but basically if you go to the archives view and have a target with archived recordings, wait for a auto (with that setting enabled) refreshRecordingList React.useEffect(), and the target and its archived recordings disappear.

Looks like this happens only in All Archive. I will take a look.

@andrewazores
Copy link
Member Author

https://github.com/cryostatio/cryostat/actions/runs/3228695054/jobs/5285820375#step:8:2300

I'm trying to reproduce this CI failure locally with bash repeated-integration-tests.bash 5 InterleavedExternalTargetRequestsIT but it seems to run cleanly and succeed every time. I suspect this is actually occurring due to a cleanup failure elsewhere, perhaps in the AutoRulesIT.

I'll take some more time to try bash repeated-integration-tests.bash 5 AutoRulesIT,InterleavedExternalTargetRequestsIT etc. to see if I can narrow down the specific problem.

bash repeated-integration-tests.bash 10 InterleavedExternalTargetRequestsIT worked for me with no failures locally. bash repeated-integration-tests.bash 10 AutoRulesIT,InterleavedExternalTargetRequestsIT has lots of failures where there are more recordings on the targets than expected - I'm a little distracted so I haven't thoroughly checked but it appears that if AutoRulesIT runs first, then InterleavedExternalTargetRequestsIT fails on run afterward. Seems like an AutoRulesIT cleanup problem for sure.

Copy link
Member

@maxcao13 maxcao13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Barring that one specific issue, looks good to me. Great work!

@andrewazores
Copy link
Member Author

We'll have to keep an eye on that issue and anything else that comes up in the next couple of weeks. If we find anything seriously broken there is a little time after feature freeze and the upstream branching where we can push important bugfixes to the version branch. I would really appreciate if you and @tthvo can try various different scenarios exercising this code, particularly in OpenShift, and see if any other problems appear.

@andrewazores andrewazores merged commit 801779d into cryostatio:main Oct 12, 2022
@andrewazores andrewazores deleted the async-metadata branch October 12, 2022 21:29
@maxcao13
Copy link
Member

maxcao13 commented Oct 14, 2022

I think everything looks good and I don't see any bugs from testing.
One thing I'm not sure about if it's a contrived example or not, but if you use smoketest and create a custom target with url localhost:0 and archive a recording using that custom target, and restart cryostat, the archived recording won't show up later on either service//:...cryostat:9091 or localhost:0.
But this seems to only happen with localhost:0 or other localhost prefixed urls, using a url like cryostat:9093 is okay.

Still working on this. I think this is actually a problem in general with any targets that appear in Discovery Storage immediately at startup rather than being discovered later on, so it's a pretty significant thing that needs to work.

Coming back to this, I seem to have problems with the current commit.

Sometimes, a few targets are unable to be connected to when doing

logger.info("Starting archive migration");
archiveHelper.migrate(executor);
logger.info("Successfully migrated archives");
pruneStaleMetadata(staleMetadata);
logger.info("Successfully pruned all stale metadata");
platformClient.listDiscoverableServices().stream().forEach(this::handleFoundTarget);

And for some reason, that messes up metadata after startup is completed.

I removed the handleFoundTarget on each discovered service and it seemed to go back to normal. I realize I didn't understand why this was needed, could you explain again, sorry about that.

@andrewazores
Copy link
Member Author

I'll see if I can reproduce it.

And for some reason, that messes up metadata after startup is completed.

Messes it up in what way?

I realize I didn't understand why this was needed, could you explain again, sorry about that.

I was writing up an explanation here about doing the check/update/prune work for metadata when Cryostat restarts and the target applications around it may or may not have changed, but actually I think the existing mechanism for emitting target events when the discovery plugins perform updates against the database should already cover this case now that I think about it and write it out.

The discovery database should contain information about what each builtin plugin knew at the time Cryostat went offline, and when Cryostat comes back online and restarts that builtin plugin, it will query the platform for whatever the current state is. If the set of applications reported here differs from the previous state in the database then that should emit a target discovery event, which should in turn call accept(TargetDiscoveryEvent) on the metadata manager, which effectively queues the check/update/prune work for that target for processing by a worker thread.

I'll go back and review my own work, the data flow, and the interaction between components here. I think removing the handleFoundTarget at startup as you've tried might be OK.

@maxcao13
Copy link
Member

maxcao13 commented Oct 17, 2022

I'll see if I can reproduce it.

And for some reason, that messes up metadata after startup is completed.

Messes it up in what way?

I realize I didn't understand why this was needed, could you explain again, sorry about that.

I was writing up an explanation here about doing the check/update/prune work for metadata when Cryostat restarts and the target applications around it may or may not have changed, but actually I think the existing mechanism for emitting target events when the discovery plugins perform updates against the database should already cover this case now that I think about it and write it out.

The discovery database should contain information about what each builtin plugin knew at the time Cryostat went offline, and when Cryostat comes back online and restarts that builtin plugin, it will query the platform for whatever the current state is. If the set of applications reported here differs from the previous state in the database then that should emit a target discovery event, which should in turn call accept(TargetDiscoveryEvent) on the metadata manager, which effectively queues the check/update/prune work for that target for processing by a worker thread.

I'll go back and review my own work, the data flow, and the interaction between components here. I think removing the handleFoundTarget at startup as you've tried might be OK.

Hmm... I'm not sure what I encountered before in terms of the actual bug and I can't seem to recreate it anymore, so might have just been a problem on my end. I will continue to test and try it using Openshift-like deployments as well.

Though, I agree that the target discovery events accept hook in the metadataManager should probably cover the case you are talking about, since if the target was updated, then the event would be emitted to cover it, and if it didn't then we wouldn't need to handle the target anyways.

Aside, there are some little fixes that I think should be put into the newest release, how should that work, do I just pull request against that branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[Task] Rule deletion broken [Task] Metadata and JVM ID computation can block threads
3 participants