[KCI-612] Clean up a terminated query's state stores #7729

lct45 · 2021-06-25T16:10:20Z

Description

In a recent disk out of space incident, a few terminated queries had state stores that didn't get cleaned up. This took up a lot of disk space and ultimately had to be cleaned up manually. It seems like because the query was terminated and quickly after the pod was rolled, the query wasn't restarted but the stores didn't get cleaned up before we rolled the pod.

Now we'll check state stores when we initialize the app and remove and that don't match a currently running query.

Testing done

Unit + integration tests included

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestApplication.java

guozhangwang

I've made a quick pass on the non-testing code.

guozhangwang · 2021-06-28T22:26:47Z

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestApplication.java

+              f -> {
+                if (stateStoreNames.stream().noneMatch((name) -> f.getName().endsWith(name))) {
+                  try {
+                    Files.walk(f.toPath())


nit: I personally like the Files.walkFileTree since it allows us to print a more detailed error message, e.g. which file delete encounters the error, with the visitor pattern.

An example of its usage: https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/common/utils/Utils.java#L812

Plus, since we would not expect to be finding any existing state store dirs that do not match existing persistent stores, we'd also log a warning if we indeed successfully delete something here.. e.g. "WARN: Deleted local state store for non-existing query {}, this is not expected and probably due to a race condition when query {} was dropped before".

guozhangwang · 2021-06-28T22:38:32Z

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestApplication.java

@@ -466,6 +509,8 @@ private void initialize(final KsqlConfig configWithApplicationServer) {
        ksqlConfigNoPort
    );
    commandRunner.processPriorCommands();
+    cleanupOldStateDirectories(configWithApplicationServer);


If we indeed have large amount of dangling state store dirs here there's a small risk that the restoring of existing queries -- i.e. starting the execution of them -- may already failed. Hence I'm wondering if we can refactor the timing of triggering this function a bit, i.e. after we've consumed through all the commands from the cmd topic, but before we start executing the restoration of queries?

That makes sense to me, probably right here: https://github.com/confluentinc/ksql/blob/master/ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/computation/CommandRunner.java#L286. WDYT? I can make that shift

Yeah that seems a better ordering to me.

agavra · 2021-06-28T22:40:22Z

does this fix #7720? cc @AlanConfluent

guozhangwang · 2021-06-28T22:47:13Z

Also it makes me thinking: since we only tries to do the cleanup upon (re-)starting the app, we would still accumulate leaked state stores as we continue to run until we exhaust 500GB --- in the future, it may be 125GB --- in which case we would error out and trying to replace the thread only, and hence would fall into the loop still. I.e. unless the pod is bounced either due to upgrade/scaling or pro-actively by ourselves, we would still have this issue. Is that right?

lct45 · 2021-06-29T15:20:31Z

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestApplication.java

+  public void cleanupOldStateDirectories(final KsqlConfig configWithApplicationServer) {
+    final String stateDir = configWithApplicationServer.getKsqlStreamConfigProps().getOrDefault(
+            StreamsConfig.STATE_DIR_CONFIG,
+            StreamsConfig.configDef().defaultValues().get(StreamsConfig.STATE_DIR_CONFIG)).toString();


This default is a little confusing to me... We confirm that the state directory is available here, in ksql server main, but we don't set the config if we end up using the default. This means that if we're using a state store from the default, we can't use the config to access the store name anywhere...

I assume if we don't set the config there (if it's not already set ofc), it's so users can set the config later ..? But it does make it a little tricky to remember to use the default instead of the config. cc @swist

Hmm.. I agree this is indeed a bit weird, maybe overlooked rather than intentional?

lct45 · 2021-06-29T15:20:55Z

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestApplication.java

+            .map(PersistentQueryMetadata::getQueryApplicationId)
+            .collect(Collectors.toSet());
+    try {
+      Files.walkFileTree(Paths.get(stateDir), new SimpleFileVisitor<Path>() {


@guozhangwang updated this, lmk if it looks good to you

Left a few more comments.

guozhangwang · 2021-06-30T00:39:44Z

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestApplication.java

+  public void cleanupOldStateDirectories(final KsqlConfig configWithApplicationServer) {
+    final String stateDir = configWithApplicationServer.getKsqlStreamConfigProps().getOrDefault(
+            StreamsConfig.STATE_DIR_CONFIG,
+            StreamsConfig.configDef().defaultValues().get(StreamsConfig.STATE_DIR_CONFIG)).toString();


Hmm.. I agree this is indeed a bit weird, maybe overlooked rather than intentional?

guozhangwang · 2021-06-30T00:45:29Z

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestApplication.java

+            StreamsConfig.STATE_DIR_CONFIG,
+            StreamsConfig.configDef().defaultValues().get(StreamsConfig.STATE_DIR_CONFIG)).toString();
+
+    final Set<String> stateStoreNames =


I think it is not stateStoreNames, but queryNames?

Is this a valid comment? Please correct me if I'm wrong.

It should be both, right? I guess semantically we can call it queryNames, although it's really queryApplicationId's since those have the formatting that matches the state store names. But this list should match the directories within the top level state store folder (aside from queries that didn't get cleaned up)

guozhangwang · 2021-06-30T00:47:04Z

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestApplication.java

+            try {
+              Files.delete(path);
+              log.warn("Deleted local state store for non-existing query {}. This is not expected and was likely due to a " +
+                      "race condition when the query was dropped before.", path.getFileName());


I think we need to be forward-looking a bit since as we binpack queries into KS, the local state dir names may not be the query ids but named topology ids (cc @ableegoldman @wcarlson5 could you clarify)?

guozhangwang · 2021-06-30T00:47:44Z

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestApplication.java

+            .map(PersistentQueryMetadata::getQueryApplicationId)
+            .collect(Collectors.toSet());
+    try {
+      Files.walkFileTree(Paths.get(stateDir), new SimpleFileVisitor<Path>() {


Left a few more comments.

guozhangwang · 2021-06-30T00:48:57Z

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestApplication.java

+
+        @Override
+        public FileVisitResult visitFile(Path path, BasicFileAttributes attrs) {
+          if (!stateStoreNames.contains(path.getFileName())) {


Hmm.. are we deleting at the per-query granularity (i.e. if we check the top-level query is not in the persistent query metadata, we would just delete all its sub-folders/files) or per-state-store granularity here? I think the stateStoreNames here is really storing query ids right?

Renamed to queryApplication Ids

guozhangwang · 2021-06-30T00:49:41Z

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestApplication.java

@@ -466,6 +509,8 @@ private void initialize(final KsqlConfig configWithApplicationServer) {
        ksqlConfigNoPort
    );
    commandRunner.processPriorCommands();
+    cleanupOldStateDirectories(configWithApplicationServer);


Yeah that seems a better ordering to me.

vvcephei · 2021-07-01T18:05:13Z

Thanks for this, @lct45 . Just checking: does this resolve #7720 ?

guozhangwang · 2021-07-01T18:08:35Z

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestApplication.java

+            StreamsConfig.STATE_DIR_CONFIG,
+            StreamsConfig.configDef().defaultValues().get(StreamsConfig.STATE_DIR_CONFIG)).toString();
+
+    final Set<String> stateStoreNames =


Is this a valid comment? Please correct me if I'm wrong.

guozhangwang · 2021-07-01T18:10:13Z

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestApplication.java

+                try {
+                  Files.delete(path);
+
+                  log.warn(


Is path.getFileName() gonna be query names? Or would it be possible to be at the state store or even deeper level? Ditto below for postVisitDirectory.

I don't know how this file visting system works like the first one did, but everything within the first level of the state store should match query names. Anything within those directories that don't match current query names should be deleted.

So, if we hit visitFile for the first layer within the state store it should match the query application id, which is the list we've pulled above. Anything deeper I don't know -> makes me think we need to rethink this algo a bit. We want to delete recursively those directories in the first layer without checking to see if the files match

lct45 · 2021-07-01T18:45:20Z

I think it would resolve #7720 @vvcephei @agavra if I'm reading that correctly. It sounds like the issue @AlanConfluent reported is the same one @guozhangwang reported in #7729

rodesai

Sorry for the late feedback 😬 - this kept flying under my radar. I think we can simplify the logic here, and reuse some of the existing interfaces for cleaning up leaked state that @AlanConfluent implemented in #6714

At a high level, this patch needs to do the following:

build up a list of IDs for queries that were terminated which we may have missed cleaning up.
clean up the resources associated with the list of IDs we built up in (1)

For (1) I think we can use the local state dirs like you've done here. However we just need to list the contents of the state dir and compare it to the IDs in ksqlEngine.getPersistentQueries() - we don't need to walk the whole tree.

To do the actual clean up I think we should reuse QueryCleanupService.QueryCleanupTask. This means we add the state directory cleanup to that class's run method. This will also have us cleanup leaked stores for transient queries.

lct45 · 2021-07-08T14:46:29Z

@rodesai made some of the changes you suggested but I'm not totally sure I got everything you were suggesting so if you could glance over it that would be great. @guozhangwang I moved the obsolete state directory list into the command runner but had to pass in new parameters to get everything to work. LMK if this looks good and I'll update the tests that I broke

rodesai

This is looking good. Just a couple minor points inline.

rodesai · 2021-07-13T03:41:23Z

ksqldb-engine/src/main/java/io/confluent/ksql/engine/QueryCleanupService.java

@@ -106,6 +114,16 @@ public String getAppId() {

    @Override
    public void run() {
+      try {
+        FileUtils.deleteDirectory(new File(stateDir + appId));


is there any way to get this path out of streams instead of computing it here?

I don't think so, and I'm not sure it would make sense. We get the list of directories to cleanup as strings from the top-level state directory. To get the corresponding path, if it exists, we'd have to take the string of the directory, map it back to the metadata or streams instance, and then get the path name.

We could compute the path name in the command runner and pass in the full path here as appId, but that would make the next few checks somewhat messy since we wouldn't really have an appId in that case. That would also still require the new File. WDYT @rodesai ?

We could compute the path name in the command runner and pass in the full path here as appId, but that would make the next few checks somewhat messy since we wouldn't really have an appId in that case. That would also still require the new File. WDYT @rodesai ?

Yeah I agree what's already implemented is preferable. No worries I think it's ok as is - it would be an incompatible change for streams to change the way the state dir is computed anyway.

rodesai · 2021-07-13T03:45:48Z

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/computation/CommandRunner.java

@@ -284,6 +294,19 @@ public void processPriorCommands() {
          .getKsqlEngine()
          .getPersistentQueries();

+      final Set<String> stateStoreNames =


nit: this feels a little out of place at the top level of this method, and also leaks some internal details about the underlying stream processing runtime into this class. Could we break it into it's own interface like this that we pass into this class and call here?

class PersistentQueryCleanup { void cleanupLeakedQueries(List<PersistentQueryMetadata> activeQueryIds); }

This interface would get initialized from ksqlServerRestApplication, right? So we'd get the stateDir and serviceContext from there? What's the benefit of doing this as an interface instead of a class, and if we do an interface is the idea that it's implemented in ksqlServerRestApplication or commandRunner?

We'd implement the interface from KsqlServerRestApplication, which is the benefit. Ideally we should avoid writing code in this class that assumes the underlying runtime is KafkaStreams.

I'm not sure I follow - what's the benefit of doing PersistentQueryCleanup as an interface in KsqlServerRestApplication instead of a stand-alone class?

I think it's a good goal to try to keep the kafka-streams specific code abstracted away from the rest of the server/engine. It keeps things decoupled which makes the code easier to understand and reason about. Putting an interface between the command runner and the streams code that does clean up makes it easier to potentially try out different runtimes in the future. We'll probably never do this (might be a fun hack day project 😄) but it's a good principle to try to structure the code this way.

I made a pass at doing an interface - I can't tell if it's what you want. I get that we want to keep streams code abstracted, but it's odd to me that we'd want to implement the PersistentQueryCleanup inside of the rest app if we're trying to keep it separate, it makes more sense to me to create a standalone class for PersistenteQueryCleanup that the rest app instantiates and then passes to the command runner, who then calls a method in the standalone class, rather than doing a callback to a method inside of the rest app (which is what called the command runner in the first place).

The other option for interface would be to have KsqlRestApplication as a whole implement PersistentQueryCleanup but this seemed like it would make testing harder since there wouldn't be a simple constructor for a PersistentQueryCleanup object. LMK your thoughts

but it's odd to me that we'd want to implement the PersistentQueryCleanup inside of the rest app if we're trying to keep it separate, it makes more sense to me to create a standalone class for PersistenteQueryCleanup that the rest app instantiates and then passes to the command runner

Either way is fine by me - putting it in a standalone class is probably cleaner. It would be good to have an interface the standalone class implements to keep the implementation completely separate.

lct45 · 2021-07-15T13:14:37Z

ksqldb-engine/src/main/java/io/confluent/ksql/engine/QueryCleanupService.java

+            appId);
+      } catch (IOException e) {
+        LOG.error("Error cleaning up state directory {}\n. {}", appId, e);
+      }


After running through testing I actually think this is kind of messy. We don't really want to try anything else in this run() call, right? With the tests, it spits out a bunch of errors because it can't clean up the schema registry etc. Repurposing this feels a little weird unless we really do want those checks for any leftover persistent queries. Thoughts @rodesai ?

Not sure I follow what the problem is. Which checks are you referring to?

Below this we check if we can cleanup internal topic schemas and delete internal topics. This is what the logs end up looking like:
[2021-07-15 08:08:07,100] WARN Could not clean up the schema registry for query: fakeStateStore (io.confluent.ksql.schema.registry.SchemaRegistryUtil:72)
java.lang.NullPointerException
at io.confluent.ksql.schema.registry.SchemaRegistryUtil.getSubjectNames(SchemaRegistryUtil.java:70)
at io.confluent.ksql.schema.registry.SchemaRegistryUtil.getInternalSubjectNames(SchemaRegistryUtil.java:194)
at io.confluent.ksql.schema.registry.SchemaRegistryUtil.cleanupInternalTopicSchemas(SchemaRegistryUtil.java:53)
at io.confluent.ksql.engine.QueryCleanupService$QueryCleanupTask.lambda$run$0(QueryCleanupService.java:127)
at io.confluent.ksql.engine.QueryCleanupService$QueryCleanupTask.tryRun(QueryCleanupService.java:144)
at io.confluent.ksql.engine.QueryCleanupService$QueryCleanupTask.run(QueryCleanupService.java:126)
at io.confluent.ksql.engine.QueryCleanupService.run(QueryCleanupService.java:64)
at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
at com.google.common.util.concurrent.Callables$4.run(Callables.java:117)
at java.lang.Thread.run(Thread.java:748)
[2021-07-15 08:08:07,105] WARN Failed to cleanup internal topics for fakeStateStore (io.confluent.ksql.engine.QueryCleanupService:146)
java.lang.NullPointerException
at io.confluent.ksql.engine.QueryCleanupService$QueryCleanupTask.lambda$run$1(QueryCleanupService.java:134)
at io.confluent.ksql.engine.QueryCleanupService$QueryCleanupTask.tryRun(QueryCleanupService.java:144)
at io.confluent.ksql.engine.QueryCleanupService$QueryCleanupTask.run(QueryCleanupService.java:134)
at io.confluent.ksql.engine.QueryCleanupService.run(QueryCleanupService.java:64)
at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
at com.google.common.util.concurrent.Callables$4.run(Callables.java:117)
at java.lang.Thread.run(Thread.java:748)
[2021-07-15 08:08:07,108] WARN Failed to cleanup internal consumer groups for fakeStateStore (io.confluent.ksql.engine.QueryCleanupService:146)

The cleanup task should do it's best to clean up everything that could possibly be left behind locally and in kafka/sr for the query. Are you worried about the case where the schema/group/topic doesn't exist? The cleanup code should be able to handle the case where something it's supposed to clean up doesn't exist by just logging-and-continuing. If the resource doesn't exist anymore there's nothing to clean up - so the state is what we want it to be.

In this case I'd guess that the test setup has some problem which causes us to throw an NPE where we would never actually throw an NPE when actually running ksql. Do we know what value is getting set to null?

In this case I think it's the context you pass to PersistentQueryCleanup from PersistentQueryCleanupTest - you'd need to setup that context to return a mock schema registry client and admin client.

Ahhhhhhh okay that makes more sense, this cleanup order should be good then

rodesai

LGTM!

guozhangwang · 2021-07-21T18:42:19Z

Sorry for getting late on the final pass of this, I also like the new approach better -- doing that in the background task than tying up the starting app thread. LGTM!

lct45 requested a review from a team as a code owner June 25, 2021 16:10

lct45 requested review from guozhangwang and swist June 25, 2021 16:10

lct45 commented Jun 28, 2021

View reviewed changes

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestApplication.java Outdated Show resolved Hide resolved

guozhangwang reviewed Jun 28, 2021

View reviewed changes

guozhangwang requested a review from rodesai June 28, 2021 22:39

lct45 commented Jun 29, 2021

View reviewed changes

guozhangwang reviewed Jun 30, 2021

View reviewed changes

swist force-pushed the clean_state_stores branch from d50e5af to f05d217 Compare July 1, 2021 12:12

guozhangwang reviewed Jul 1, 2021

View reviewed changes

rodesai reviewed Jul 7, 2021

View reviewed changes

rodesai reviewed Jul 13, 2021

View reviewed changes

swist and others added 7 commits July 14, 2021 10:19

feat: add orphaned state directory cleanup to KsqlRestApplication

67ccda4

adding testing for cleaning state stores (#7726)

920932a

fix: make sure we create queries with state when testing state cleanup

cf91c09

fix: update integration test

bfbf440

review updates

8d1a101

fix: checkstyle fixes

61aef25

rohans comments

536827f

lct45 commented Jul 15, 2021

View reviewed changes

breaking out query cleanup

bcbf215

lct45 force-pushed the clean_state_stores branch from bdfa5e4 to bcbf215 Compare July 15, 2021 14:07

cleanup -> interface

6d77b88

lct45 force-pushed the clean_state_stores branch from 755773b to 6d77b88 Compare July 20, 2021 16:05

rodesai approved these changes Jul 20, 2021

View reviewed changes

lct45 merged commit eddac72 into master Jul 20, 2021

nateab mentioned this pull request Sep 30, 2021

Cleanup abandoned state stores on startup #7720

Closed

suhas-satish mentioned this pull request Nov 2, 2021

ksqlDB not clearing state store data from disk #8288

Closed

[KCI-612] Clean up a terminated query's state stores #7729

[KCI-612] Clean up a terminated query's state stores #7729

Conversation

lct45 commented Jun 25, 2021

Description

Testing done

Reviewer checklist

guozhangwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agavra commented Jun 28, 2021

guozhangwang commented Jun 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vvcephei commented Jul 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lct45 commented Jul 1, 2021

rodesai left a comment

Choose a reason for hiding this comment

lct45 commented Jul 8, 2021

rodesai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rodesai left a comment

Choose a reason for hiding this comment

guozhangwang commented Jul 21, 2021