Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade assistant should warn of incompatible system indices settings when migrating from 7 to 8 (the index will become red) #88324

Open
lucabelluccini opened this issue Jul 6, 2022 · 7 comments
Labels
>bug :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team

Comments

@lucabelluccini
Copy link
Contributor

lucabelluccini commented Jul 6, 2022

Elasticsearch Version

7.x, 8.x

Installed Plugins

No response

Java Version

bundled

OS Version

N/A

Problem Description

System indices do not accept several settings to be overridden in 8.x - only few are allowed at the time of writing:

index.blocks.read_only
index.blocks.read
index.blocks.write
index.blocks.metadata
index.blocks.read_only_allow_delete

When migrating from 7.x to 8.x, it can happen that for some reason a system index (e.g. .security-7) has some setting which is allowed in 7.x but the shards fail to be allocated while upgrading to 8.x

Steps to Reproduce

  1. Create a 7.17.5 cluster.

  2. Perform:

PUT .security-7/_settings
{
  "index.search.slowlog.threshold.query.warn": "10s",
  "index.search.slowlog.threshold.query.info": "5s",
  "index.search.slowlog.threshold.query.debug": "2s",
  "index.search.slowlog.threshold.query.trace": "500ms",
  "index.search.slowlog.threshold.fetch.warn": "1s",
  "index.search.slowlog.threshold.fetch.info": "800ms",
  "index.search.slowlog.threshold.fetch.debug": "500ms",
  "index.search.slowlog.threshold.fetch.trace": "200ms",
  "index.search.slowlog.level": "info"
}
  1. Go to the Upgrade Assistant - All good

  2. Upgrade to 8.3.1

  3. The upgrade will at a given point trigger a cluster red state.

  4. The cluster allocation explain will be:

{
  "can_allocate": "yes",
  "index": ".security-7",
  "target_node": {
    "attributes": {
      "server_name": "instance-0000000001.018d26f51e90476dac2a56befc09ccfc",
      "availability_zone": "us-central1-b",
      "region": "unknown-region",
      "instance_configuration": "gcp.es.datahot.n2.68x10x45",
      "xpack.installed": "true",
      "logical_availability_zone": "zone-1",
      "data": "hot"
    },
    "transport_address": "10.42.4.59:19836",
    "id": "iy10uNC9QRCJVc9xcRkJsg",
    "name": "instance-0000000001"
  },
  "node_allocation_decisions": [
    {
      "node_decision": "yes",
      "transport_address": "10.42.6.132:19611",
      "node_name": "instance-0000000000",
      "node_id": "LiMkJ1k9QUWqa-j972KGmw",
      "store": {
        "in_sync": true,
        "allocation_id": "aK-9wh3OQgq2RoyyRfsPaQ"
      },
      "node_attributes": {
        "server_name": "instance-0000000000.018d26f51e90476dac2a56befc09ccfc",
        "availability_zone": "us-central1-c",
        "region": "unknown-region",
        "instance_configuration": "gcp.es.datahot.n2.68x10x45",
        "xpack.installed": "true",
        "logical_availability_zone": "zone-0",
        "data": "hot"
      }
    },
    {
      "node_decision": "yes",
      "transport_address": "10.42.4.59:19836",
      "node_name": "instance-0000000001",
      "node_id": "iy10uNC9QRCJVc9xcRkJsg",
      "store": {
        "in_sync": true,
        "allocation_id": "BcMGDowKQlKcm9TbEEiaHw"
      },
      "node_attributes": {
        "server_name": "instance-0000000001.018d26f51e90476dac2a56befc09ccfc",
        "availability_zone": "us-central1-b",
        "region": "unknown-region",
        "instance_configuration": "gcp.es.datahot.n2.68x10x45",
        "xpack.installed": "true",
        "logical_availability_zone": "zone-1",
        "data": "hot"
      }
    }
  ],
  "allocation_id": "BcMGDowKQlKcm9TbEEiaHw",
  "current_state": "unassigned",
  "shard": 0,
  "primary": true,
  "note": "No shard was specified in the explain API request, so this response explains a randomly chosen unassigned shard. There may be other unassigned shards in this cluster which cannot be assigned for different reasons. It may not be possible to assign this shard until one of the other shards is assigned correctly. To explain the allocation of other shards (whether assigned or unassigned) you must specify the target shard in the request to this API.",
  "allocate_explanation": "Elasticsearch can allocate the shard.",
  "unassigned_info": {
    "last_allocation_status": "no",
    "reason": "ALLOCATION_FAILED",
    "failed_allocation_attempts": 5,
    "at": "2022-07-06T17:27:45.267Z",
    "details": "failed shard on node [iy10uNC9QRCJVc9xcRkJsg]: failed to create index, failure java.lang.IllegalArgumentException: unknown setting [index.search.slowlog.level] please check that any required plugins are installed, or check the breaking changes documentation for removed settings\n\tat org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:563)\n\tat org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:509)\n\tat org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:479)\n\tat org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:688)\n\tat org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:607)\n\tat org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:177)\n\tat org.elasticsearch.indices.cluster.IndicesClusterStateService.createIndices(IndicesClusterStateService.java:505)\n\tat org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:232)\n\tat org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:545)\n\tat org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:531)\n\tat org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:504)\n\tat org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:429)\n\tat org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:155)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:710)\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:260)\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:223)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.lang.Thread.run(Thread.java:833)\n"
  }
}
  1. Recovering from this situation would require having a role with "allow_restricted_indices": true and using a user from the file realm (as the native realm is unavailable due to .security index being red).

But the API request below, executed with a role having "allow_restricted_indices": true:

PUT .security-7/_settings
{
    "index": {
        "search": {
            "slowlog": null
        }
    }
}

Is still rejected again with:

{
  "status": 403,
  "error": {
    "root_cause": [
      {
        "reason": "action [indices:admin/settings/update] is unauthorized for user [elastic...] with roles [found-internal-admin,superuser] on restricted indices [.security-7], this action is granted by the index privileges [manage,all]",
        "type": "security_exception"
      }
    ],
    "type": "security_exception",
    "reason": "action [indices:admin/settings/update] is unauthorized for user [elastic...] with roles [found-internal-admin,superuser] on restricted indices [.security-7], this action is granted by the index privileges [manage,all]"
  }
}
  1. Trying with:
POST _snapshot/found-snapshots/cloud-snapshot-2022.07.06-sormf1rgr7iq3v8hmysqzg/_restore
{
  "include_global_state": false,
  "feature_states": ["none"],
  "indices": ".security-7",
    "index_settings": {
    "index.search.slowlog": null
  }
}

We get:

{
  "status": 400,
  "error": {
    "root_cause": [
      {
        "reason": "requested system indices [.security-7], but system indices can only be restored as part of a feature state",
        "type": "illegal_argument_exception"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "requested system indices [.security-7], but system indices can only be restored as part of a feature state"
  }
}
  1. Trying with:
POST _snapshot/found-snapshots/cloud-snapshot-2022.07.06-sormf1rgr7iq3v8hmysqzg/_restore
{
  "indices": "-*",
  "include_global_state": false,
  "feature_states": ["security"],
  "index_settings": {
    "index.search.slowlog": null
  }
}

The request is acknowledged.

  1. The index becomes green:
green open .security-7                     8Yt___ZlRg6_mZBAlmQBdQ 1 1 122  79   1.1mb 458.4kb
  1. The settings are still there!
{
  ".security-7": {
    "settings": {
      "index": {
        "provided_name": ".security-7",
        "number_of_replicas": "1",
        "search": {
          "slowlog": {
            "level": "debug",
            "threshold": {
              "query": {
                "warn": "10s",
                "debug": "2s",
                "info": "5s",
                "trace": "500ms"
              },
              "fetch": {
                "warn": "1s",
                "debug": "500ms",
                "info": "800ms",
                "trace": "200ms"
              }
            }
          }
        },
        ...
      "archived": {
        "index": {
          "search": {
            "slowlog": {
              "level": "info"
            }
          }
        }
      }

  1. Executing:
PUT .security-7/_settings?flat_settings=true
{
    "index.search.slowlog.*": null
}

We get:

{
  "status": 400,
  "error": {
    "suppressed": [
      {
        "reason": "unknown setting [archived.index.search.slowlog] please check that any required plugins are installed, or check the breaking changes documentation for removed settings",
        "type": "illegal_argument_exception"
      }
    ],
    "root_cause": [
      {
        "reason": "unknown setting [archived.index.search.slowlog.level] please check that any required plugins are installed, or check the breaking changes documentation for removed settings",
        "type": "illegal_argument_exception"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "unknown setting [archived.index.search.slowlog.level] please check that any required plugins are installed, or check the breaking changes documentation for removed settings"
  }
}
  1. With:
PUT .security*/_settings
{
    "index.slowlog.*": null
}

We get the acknowledge, but the index has still the broken settings.

Logs (if relevant)

No response

@lucabelluccini lucabelluccini added >bug :Core/Infra/Core Core issues without another label needs:triage Requires assignment of a team area label labels Jul 6, 2022
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Jul 6, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@lucabelluccini lucabelluccini changed the title Upgrade assistant should warn of incompatible system indices settings when migrating from 7 to 8 Upgrade assistant should warn of incompatible system indices settings when migrating from 7 to 8 (the index will become red) Jul 6, 2022
@grcevski grcevski added team-discuss and removed needs:triage Requires assignment of a team area label labels Jul 6, 2022
@grcevski grcevski self-assigned this Jul 20, 2022
@grcevski
Copy link
Contributor

grcevski commented Jul 20, 2022

We had a discussion on this at the core/infra meeting and there are few follow-up bugs/issues we need to resolve here:

  • The upgrade assistant should've caught this and it didn't. [Confirmed system indices are ignored by upgrade assistant because users cannot affect them]
  • Archived settings are useless on system indices and we should be simply removing them on startup, instead of archiving. [Done via https://github.com/Delete invalid settings for system indices #88903]
  • Archiving filtered settings (before secure settings) can cause issues such as customer password exposed and we need to fix this somehow.
  • Preventing index setting and cluster settings updates in presence of archived settings is not the best way to warn users that they need to fix something in their setup. Instead of blocking updates to these settings, we should leverage the new health API to bring the problems front and centre.

@grcevski
Copy link
Contributor

I did some debugging on why the upgrade assistant didn't warn us on these deprecated options on a system index, and it turns out that Elasticsearch correctly reports the critical deprecation, however the following code in upgrade assistant ignores deprecations on system indices:

https://github.com/elastic/kibana/blob/1bfeab7553899efcfa9a6e46b37dc3c7681dcf3b/x-pack/plugins/upgrade_assistant/server/lib/es_deprecations_status.ts#L33

We correctly reported in Elasticsearch:

"index_settings": {
    ".security-7": [
      {
        "level": "critical",
        "message": "Setting [index.indexing.slowlog.level] is deprecated",
        "url": "https://ela.st/es-deprecation-7-slowlog-settings",
        "details": "Remove the [index.indexing.slowlog.level] setting. Use the [index.*.slowlog.threshold] settings to set the log levels.",
        "resolve_during_rolling_upgrade": false,
        "_meta": {
          "actions": [
            {
              "action_type": "remove_settings",
              "objects": [
                "index.indexing.slowlog.level"
              ]
            }
          ]
        }
      }
    ]
  }

@lucabelluccini
Copy link
Contributor Author

That's great @grcevski - I'm sorry I've not tried to call the API on ES side when reproducing.

My biggest concern here is allowing the node to start and lead to a red index and we have no way to fix the settings once the index is migrated. Is there something I didn't try to do in the reproduction which would allow a user to remove the problematic settings if they've already upgraded?

@grcevski
Copy link
Contributor

Oh no problem @lucabelluccini, I was just mentioning what I had found. We'll need to fix this one way or another. It seems that the upgrade assistant code expects that all problems related to system indices will be mentioned in the system index migration section, while these particular ones are generic for all indices and we report them in the normal deprecations. I'll have a team discussion on this to see what's the best way to fix this.

@BBQigniter
Copy link

BBQigniter commented Sep 20, 2022

found this issue too late :(

Edit:

Some more details - I stumbled into this issue today on our staging cluster which is running on Kubernetes/ECK. The upgrade worked pretty well until only a none-data holding master node-pod and the last hot node-pod with version 7.17.6 was left where the indices with the index.indexing.slowlog.level-setting where moved in the meantime. From there on the upgrade procedure stopped and I had 10 hidden indices that were in yellow-state.

After thorough consultation with the marvelous Elastic-support ( <3 ) I was told to do a "full-cluster-restart" - I just stopped event ingestion via some logstash-pods by scaling them to 0, set cluster.routing.allocation.enable to primaries only and then deleted ALL elasticsearch-pods via kubectl at once. The pods just will be recreated.

Magically the Elasticsearch-cluster fixed itself (as always) in a few minutes and then I removed the cluster.routing.allocation.enable setting again.

@grcevski grcevski removed their assignment Feb 16, 2023
@igorwwwwwwwwwwwwwwwwwwww

We ran into this at @GitLab as well during the ES 7 => 8 upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team
Projects
None yet
Development

No branches or pull requests

5 participants