Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeoIpDownloader race conditions when starting / stopping #92888

Open
masseyke opened this issue Jan 12, 2023 · 5 comments
Open

GeoIpDownloader race conditions when starting / stopping #92888

masseyke opened this issue Jan 12, 2023 · 5 comments
Assignees
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team

Comments

@masseyke
Copy link
Member

Elasticsearch Version

8.7.0

Installed Plugins

No response

Java Version

bundled

OS Version

n/a

Problem Description

The geoip downloader has race conditions that can cause exceptions and unexpected behavior if it is stopped and started quickly. There are a few reasons for this, and they're not really easy to fix:

  1. When we stop the geoip downloader task, we delete the .geoip_databases index asynchronously (added in Fix disabling GeoIP downloader through elasticsearch.yml #76924). If you re-start the downloader immediately, the delete can still be running when the start task is running, leading to an exception and missing a missing geoip_databases index.
  2. The download is run asynchronously. So we can still be indexing data after the task has been stopped and the index has been deleted.
  3. The things that trigger the downloader to start or stop (such as updating settings or creating/deleting geoip procesors) run asynchronously. It is possible that a stop task begins before the task it is trying to stop was started.
  4. The task state is updated asynchronously. We can wind up with the wrong task state being stored in the cluster state.
  5. Each ingest node has its own local copies of the database, and they are created / deleted asynchronously.

Here is a stack trace for problem # 1:

  1> org.elasticsearch.index.IndexNotFoundException: no such index [.geoip_databases]
  1> 	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.notFoundException(IndexNameExpressionResolver.java:456) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.ensureAliasOrIndexExists(IndexNameExpressionResolver.java:469) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver$WildcardExpressionResolver.innerResolve(IndexNameExpressionResolver.java:1224) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver$WildcardExpressionResolver.resolve(IndexNameExpressionResolver.java:1130) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.resolveExpressions(IndexNameExpressionResolver.java:247) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:323) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndexNames(IndexNameExpressionResolver.java:313) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndexNames(IndexNameExpressionResolver.java:86) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.action.support.replication.TransportBroadcastReplicationAction.shards(TransportBroadcastReplicationAction.java:131) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.action.support.replication.TransportBroadcastReplicationAction.doExecute(TransportBroadcastReplicationAction.java:73) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.action.support.replication.TransportBroadcastReplicationAction.doExecute(TransportBroadcastReplicationAction.java:42) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:86) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:61) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.tasks.TaskManager.registerAndExecute(TaskManager.java:202) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.client.internal.node.NodeClient.executeLocally(NodeClient.java:112) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.client.internal.node.NodeClient.doExecute(NodeClient.java:90) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.client.internal.support.AbstractClient.execute(AbstractClient.java:380) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.client.internal.FilterClient.doExecute(FilterClient.java:57) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.client.internal.OriginSettingClient.doExecute(OriginSettingClient.java:43) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.client.internal.support.AbstractClient.execute(AbstractClient.java:380) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.client.internal.support.AbstractClient.execute(AbstractClient.java:366) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.client.internal.support.AbstractClient$IndicesAdmin.execute(AbstractClient.java:1262) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.client.internal.support.AbstractClient$IndicesAdmin.refresh(AbstractClient.java:1486) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.ingest.geoip.GeoIpDownloader.indexChunks(GeoIpDownloader.java:289) ~[main/:?]
  1> 	at org.elasticsearch.ingest.geoip.GeoIpDownloader.processDatabase(GeoIpDownloader.java:216) ~[main/:?]
  1> 	at org.elasticsearch.ingest.geoip.GeoIpDownloader.updateDatabases(GeoIpDownloader.java:178) ~[main/:?]
  1> 	at org.elasticsearch.ingest.geoip.GeoIpDownloader.runDownloader(GeoIpDownloader.java:328) ~[main/:?]
  1> 	at org.elasticsearch.ingest.geoip.GeoIpDownloader.runDownloader(GeoIpDownloader.java:320) ~[main/:?]
  1> 	at org.elasticsearch.ingest.geoip.GeoIpDownloaderTaskExecutor.nodeOperation(GeoIpDownloaderTaskExecutor.java:102) ~[main/:?]
  1> 	at org.elasticsearch.ingest.geoip.GeoIpDownloaderTaskExecutor.nodeOperation(GeoIpDownloaderTaskExecutor.java:48) ~[main/:?]
  1> 	at org.elasticsearch.persistent.NodePersistentTasksExecutor$1.doRun(NodePersistentTasksExecutor.java:42) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:917) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.7.0-SNAPSHOT.jar:8.7.0-SNAPSHOT]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
  1> 	at java.lang.Thread.run(Thread.java:833) ~[?:?]

Here is a stack trace for problem # 3:

org.elasticsearch.ResourceNotFoundException: the task with id geoip-downloader and allocation id 5 doesn't exist
	at org.elasticsearch.persistent.PersistentTasksClusterService$4.execute(PersistentTasksClusterService.java:263)
	at org.elasticsearch.cluster.service.MasterService$UnbatchedExecutor.execute(MasterService.java:551)
	at org.elasticsearch.cluster.service.MasterService.innerExecuteTasks(MasterService.java:1052)
	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:1017)
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:278)
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:170)
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:110)
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:148)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:850)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:257)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:223)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1589)

Steps to Reproduce

I added this to my local GeoIpDownloaderIT to reproduce this:

    public void testRapidStartStop() throws Exception {
        assumeTrue("only test with fixture to have stable results", ENDPOINT != null);
        putPipeline();
        for (int i = 0; i < 100; i++) {
            assertAcked(
                client().admin()
                    .cluster()
                    .prepareUpdateSettings()
                    .setPersistentSettings(Settings.builder().put(GeoIpDownloaderTaskExecutor.ENABLED_SETTING.getKey(), true))
                    .get()
            );
            if (i % 10 == 0) {
                verifyUpdatedDatabase();
            }
            assertAcked(
                client().admin()
                    .cluster()
                    .prepareUpdateSettings()
                    .setPersistentSettings(Settings.builder().put(GeoIpDownloaderTaskExecutor.ENABLED_SETTING.getKey(), false))
                    .get()
            );
        }
        assertAcked(
            client().admin()
                .cluster()
                .prepareUpdateSettings()
                .setPersistentSettings(Settings.builder().put(GeoIpDownloaderTaskExecutor.ENABLED_SETTING.getKey(), true))
                .get()
        );
        verifyUpdatedDatabase();
    }

Logs (if relevant)

No response

@masseyke masseyke added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Jan 12, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Jan 12, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@masseyke
Copy link
Member Author

One of the biggest problems is that of of #76924 we are deleting the index when we stop the task. This is (I believe) so that we meet our license agreement for the geoip data and don't retain it too long.
We discussed a solution being to have the geoip downloader task running all the time, and when you disable it we just delete the data and stop doing the download. Then we are guaranteed that the code kicking off the download and kicking off the delete are on the same node in the same process, and can synchronize things so that they don't happen concurrently. But (assuming some complications can be worked out there) that still leaves all the other ingest nodes that are pulling the data from the .geoip_databases index and creating / deleting their local copies. I don't know how to get around that without a distributed lock accessible from all nodes. But I think the risks of things going wrong with that lock are higher than the risks of the downloader occasionally not behaving well (after all, this is more of a problem in tests where the downloader is rapidly enabled/disabled than in production).

@jintaoit
Copy link

jintaoit commented Feb 3, 2023

I meet a similar problem. when I start elasticsearch first time ,it runs without any error.
From the log, I see it download a geoip database called GeoLite2-ASN.mmdb successfully. But when I restart it cause such error

[2023-02-01T15:34:09,249][ERROR][o.e.i.g.GeoIpDownloader ] [node1] exception during geoip databases updateorg.elasticsearch.ElasticsearchException: not all primary shards of [.geoip_databases] index are active

my server has enough space and cpu is 95% idle.

@Sagarsinghinfra360
Copy link

I am also getting the same error when I install Elasticsearch, The error is :
ERROR:
[2023-02-01T15:34:09,249][ERROR][o.e.i.g.GeoIpDownloader ] [node1] exception during geoip databases updateorg.elasticsearch.ElasticsearchException: Not all primary shards of [.geoip_databases] index are active

I read multiple comments on the Internet, they were referring me to disable "ingest. GeoIP.downloader.enabled" but this also could not resolve the error, Is anybody also facing the same issue?

@RogerWeihrauch
Copy link

Hi folks
Same situation here: since I am a real newbie on the whole ELK Stack environment, maybe this could be a hint to you:

  • start Elast w/: ingest.geoip.downloader.enabled: false in 'elasticsearch.yml' config file
  • wait 'til completely up (btw: how can I tell that [.geoip_databases] indexing/deleteing process has finished after restart w/ above option set?)
  • set option now to true to get GeoIpDatabase successfully downloaded and loaded into Elast
  • restart Elast
  • successful download on load into Elast of GeoIpDatabase ( again: how long to wait until stop of Elast?):
    -[2024-04-05T13:34:13,987][INFO ][o.e.i.g.GeoIpDownloader ] [node01] successfully downloaded geoip database [GeoLite2-ASN.mmdb]
    [2024-04-05T13:34:14,084][INFO ][o.e.i.g.DatabaseNodeService] [node01] successfully loaded geoip database file [GeoLite2-ASN.mmdb]
    [2024-04-05T13:34:18,426][INFO ][o.e.i.g.GeoIpDownloader ] [node01] successfully downloaded geoip database [GeoLite2-City.mmdb]
    [2024-04-05T13:34:18,769][INFO ][o.e.i.g.DatabaseNodeService] [node01] successfully loaded geoip database file [GeoLite2-City.mmdb]
    [2024-04-05T13:34:19,718][INFO ][o.e.i.g.GeoIpDownloader ] [node01] successfully downloaded geoip database [GeoLite2-Country.mmdb]
    [2024-04-05T13:34:19,757][INFO ][o.e.i.g.DatabaseNodeService] [node01] successfully loaded geoip database file [GeoLite2-Country.mmdb]
  • shutdown Elast
  • comment-out above option
  • restart Elast
  • and, error appears again:
    [2024-04-05T13:26:29,234][ERROR][o.e.i.g.GeoIpDownloader ] [node01] exception during geoip databases update org.elasticsearch.ElasticsearchException: not all primary shards of [.geoip_databases] index are active

This is confusing for a newbie like me.
What to do on this? Just ignore?

Thanx and regards,
Roger

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

6 participants