Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Cortex 1.15.x fails #482

Closed
davidg-datascene opened this issue Sep 27, 2023 · 8 comments
Closed

Upgrade to Cortex 1.15.x fails #482

davidg-datascene opened this issue Sep 27, 2023 · 8 comments

Comments

@davidg-datascene
Copy link

When is the helm chart going to support 1.15.x of Cortex? Thanks.

Quick test upgrading cortex image from v1.14.1 to v1.15.x causes pods to crashloop and fail to start.

helm upgrade -n cortex cortex --reset-values -f /tmp/cortex-v1.15.3 --set image.tag=v1.15.3 cortex-helm/cortex

kubectl get pods -n cortex
NAME                                     READY   STATUS             RESTARTS        AGE
cortex-alertmanager-6f4cd878f4-cvvfq     0/1     CrashLoopBackOff   9 (112s ago)    26m
cortex-compactor-0                       0/1     Running            1 (9m16s ago)   26m
cortex-distributor-76ddd84f5b-69296      1/1     Running            0               26m
cortex-distributor-76ddd84f5b-z9xcs      1/1     Running            0               26m
cortex-ingester-5c4dcd7d85-nsgqq         0/1     Running            1 (10m ago)     26m
cortex-ingester-66488dd769-krjmz         1/1     Running            0               2d1h
cortex-ingester-66488dd769-twt2g         1/1     Running            0               2d1h
cortex-nginx-5cf69b45b7-qq9ph            1/1     Running            0               16d
cortex-nginx-5cf69b45b7-tqkzb            1/1     Running            0               16d
cortex-querier-57f85ccccb-v4lsv          1/1     Running            0               2d1h
cortex-querier-58c9d857dd-bhjn4          0/1     Running            4 (54s ago)     7m35s
cortex-query-frontend-6d8fffb794-dhkgs   1/1     Running            0               25m
cortex-query-frontend-6d8fffb794-ksq84   1/1     Running            0               26m
cortex-ruler-565ccb7c44-ckjsb            0/1     Running            4 (18s ago)     7m
cortex-store-gateway-0                   0/1     Running            0               3m48s
@nschad
Copy link
Collaborator

nschad commented Sep 27, 2023

Can you share the error?

@davidg-datascene
Copy link
Author

cortex-distributor-5478b6758f-cl457      1/1     Running   0             10m
cortex-distributor-5478b6758f-lgkjc      1/1     Running   0             10m
cortex-ingester-66488dd769-krjmz         1/1     Running   0             2d23h
cortex-ingester-66488dd769-twt2g         1/1     Running   0             2d23h
cortex-nginx-5cf69b45b7-qq9ph            1/1     Running   0             16d
cortex-nginx-5cf69b45b7-tqkzb            1/1     Running   0             16d
cortex-querier-57f85ccccb-v4lsv          1/1     Running   0             2d23h
cortex-query-frontend-6cc65c85c5-bwczv   1/1     Running   0             10m
cortex-query-frontend-6cc65c85c5-rg2xm   1/1     Running   0             10m

Failing pods after upgrade from 1.14.1 to 1.15.3:
cortex-alertmanager-8647857b79-v9qdm     0/1     Running   3 (87s ago)   6m29s
cortex-compactor-0                       0/1     Running   0             6m21s
cortex-ingester-64d6b76cc8-bp5d5         0/1     Running   0             6m29s
cortex-querier-74f7f589-2f9js            0/1     Running   3 (87s ago)   6m28s
cortex-ruler-7746ccfd6f-xnjr7            0/1     Running   3 (85s ago)   6m27s
cortex-store-gateway-0                   0/1     Running   0             6m23s

k logs -n cortex cortex-querier-74f7f589-2f9js
level=info ts=2023-09-28T10:30:59.065683442Z caller=main.go:194 msg="Starting Cortex" version="(version=1.15.3, branch=HEAD, revision=21e8366)"
level=info ts=2023-09-28T10:30:59.066064447Z caller=server.go:323 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2023-09-28T10:30:59.067243862Z caller=memberlist_client.go:399 msg="Using memberlist cluster node name" name=cortex-querier-74f7f589-2f9js-36456d1f
level=info ts=2023-09-28T10:30:59.077885501Z caller=memberlist_client.go:575 msg="joined memberlist cluster" reached_nodes=4
level=info ts=2023-09-28T10:30:59.085044494Z caller=memberlist_client.go:536 msg="joined memberlist cluster" reached_nodes=4
ts=2023-09-28T10:31:00.178013763Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'cortex-querier-74f7f589-2f9js-808f43ae' from=100.77.7.114:7946"
ts=2023-09-28T10:31:02.179318929Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'cortex-querier-74f7f589-2f9js-808f43ae' from=100.77.7.50:7946"
ts=2023-09-28T10:31:02.179673135Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'cortex-querier-74f7f589-2f9js-808f43ae' from=100.77.7.42:7946"
ts=2023-09-28T10:31:02.179716035Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'cortex-querier-74f7f589-2f9js-808f43ae' from=100.77.7.109:7946"
ts=2023-09-28T10:31:02.179881038Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node cortex-querier-74f7f589-2f9js-808f43ae from=100.77.7.114:42498"
ts=2023-09-28T10:31:04.069733066Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'cortex-querier-74f7f589-2f9js-808f43ae' from=100.77.7.109:7946"

k logs cortex-ruler-7746ccfd6f-xnjr7 -n cortex
level=info ts=2023-09-28T10:31:00.404790924Z caller=main.go:194 msg="Starting Cortex" version="(version=1.15.3, branch=HEAD, revision=21e8366)"
level=info ts=2023-09-28T10:31:00.405091428Z caller=server.go:323 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"

k logs -n cortex cortex-store-gateway-0
level=info ts=2023-09-28T10:22:49.386296971Z caller=main.go:194 msg="Starting Cortex" version="(version=1.15.3, branch=HEAD, revision=21e8366)"
level=info ts=2023-09-28T10:22:49.386591376Z caller=server.go:323 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"


k logs cortex-alertmanager-8647857b79-v9qdm -n cortex
level=info ts=2023-09-28T10:27:38.465985397Z caller=main.go:194 msg="Starting Cortex" version="(version=1.15.3, branch=HEAD, revision=21e8366)"
level=info ts=2023-09-28T10:27:38.466976213Z caller=server.go:323 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"

k logs cortex-alertmanager-8647857b79-v9qdm -n cortex
level=info ts=2023-09-28T10:27:38.465985397Z caller=main.go:194 msg="Starting Cortex" version="(version=1.15.3, branch=HEAD, revision=21e8366)"
level=info ts=2023-09-28T10:27:38.466976213Z caller=server.go:323 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"

k logs -n cortex cortex-compactor-0
level=info ts=2023-09-28T10:22:48.991038111Z caller=main.go:194 msg="Starting Cortex" version="(version=1.15.3, branch=HEAD, revision=21e8366)"
level=info ts=2023-09-28T10:22:48.991531119Z caller=server.go:323 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2023-09-28T10:22:48.992992641Z caller=module_service.go:64 msg=initialising module=server
level=info ts=2023-09-28T10:22:48.993143843Z caller=module_service.go:64 msg=initialising module=memberlist-kv
level=info ts=2023-09-28T10:22:48.993146343Z caller=module_service.go:64 msg=initialising module=runtime-config
level=info ts=2023-09-28T10:22:48.993390847Z caller=module_service.go:64 msg=initialising module=compactor

k logs -n cortex cortex-ingester-64d6b76cc8-bp5d5
level=info ts=2023-09-28T10:22:37.989674144Z caller=main.go:194 msg="Starting Cortex" version="(version=1.15.3, branch=HEAD, revision=21e8366)"
level=info ts=2023-09-28T10:22:37.989985548Z caller=server.go:323 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"

@nschad
Copy link
Collaborator

nschad commented Sep 28, 2023

cortex-distributor-5478b6758f-cl457      1/1     Running   0             10m
cortex-distributor-5478b6758f-lgkjc      1/1     Running   0             10m
cortex-ingester-66488dd769-krjmz         1/1     Running   0             2d23h
cortex-ingester-66488dd769-twt2g         1/1     Running   0             2d23h
cortex-nginx-5cf69b45b7-qq9ph            1/1     Running   0             16d
cortex-nginx-5cf69b45b7-tqkzb            1/1     Running   0             16d
cortex-querier-57f85ccccb-v4lsv          1/1     Running   0             2d23h
cortex-query-frontend-6cc65c85c5-bwczv   1/1     Running   0             10m
cortex-query-frontend-6cc65c85c5-rg2xm   1/1     Running   0             10m

Failing pods after upgrade from 1.14.1 to 1.15.3:
cortex-alertmanager-8647857b79-v9qdm     0/1     Running   3 (87s ago)   6m29s
cortex-compactor-0                       0/1     Running   0             6m21s
cortex-ingester-64d6b76cc8-bp5d5         0/1     Running   0             6m29s
cortex-querier-74f7f589-2f9js            0/1     Running   3 (87s ago)   6m28s
cortex-ruler-7746ccfd6f-xnjr7            0/1     Running   3 (85s ago)   6m27s
cortex-store-gateway-0                   0/1     Running   0             6m23s

k logs -n cortex cortex-querier-74f7f589-2f9js
level=info ts=2023-09-28T10:30:59.065683442Z caller=main.go:194 msg="Starting Cortex" version="(version=1.15.3, branch=HEAD, revision=21e8366)"
level=info ts=2023-09-28T10:30:59.066064447Z caller=server.go:323 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2023-09-28T10:30:59.067243862Z caller=memberlist_client.go:399 msg="Using memberlist cluster node name" name=cortex-querier-74f7f589-2f9js-36456d1f
level=info ts=2023-09-28T10:30:59.077885501Z caller=memberlist_client.go:575 msg="joined memberlist cluster" reached_nodes=4
level=info ts=2023-09-28T10:30:59.085044494Z caller=memberlist_client.go:536 msg="joined memberlist cluster" reached_nodes=4
ts=2023-09-28T10:31:00.178013763Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'cortex-querier-74f7f589-2f9js-808f43ae' from=100.77.7.114:7946"
ts=2023-09-28T10:31:02.179318929Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'cortex-querier-74f7f589-2f9js-808f43ae' from=100.77.7.50:7946"
ts=2023-09-28T10:31:02.179673135Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'cortex-querier-74f7f589-2f9js-808f43ae' from=100.77.7.42:7946"
ts=2023-09-28T10:31:02.179716035Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'cortex-querier-74f7f589-2f9js-808f43ae' from=100.77.7.109:7946"
ts=2023-09-28T10:31:02.179881038Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node cortex-querier-74f7f589-2f9js-808f43ae from=100.77.7.114:42498"
ts=2023-09-28T10:31:04.069733066Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'cortex-querier-74f7f589-2f9js-808f43ae' from=100.77.7.109:7946"

k logs cortex-ruler-7746ccfd6f-xnjr7 -n cortex
level=info ts=2023-09-28T10:31:00.404790924Z caller=main.go:194 msg="Starting Cortex" version="(version=1.15.3, branch=HEAD, revision=21e8366)"
level=info ts=2023-09-28T10:31:00.405091428Z caller=server.go:323 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"

k logs -n cortex cortex-store-gateway-0
level=info ts=2023-09-28T10:22:49.386296971Z caller=main.go:194 msg="Starting Cortex" version="(version=1.15.3, branch=HEAD, revision=21e8366)"
level=info ts=2023-09-28T10:22:49.386591376Z caller=server.go:323 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"


k logs cortex-alertmanager-8647857b79-v9qdm -n cortex
level=info ts=2023-09-28T10:27:38.465985397Z caller=main.go:194 msg="Starting Cortex" version="(version=1.15.3, branch=HEAD, revision=21e8366)"
level=info ts=2023-09-28T10:27:38.466976213Z caller=server.go:323 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"

k logs cortex-alertmanager-8647857b79-v9qdm -n cortex
level=info ts=2023-09-28T10:27:38.465985397Z caller=main.go:194 msg="Starting Cortex" version="(version=1.15.3, branch=HEAD, revision=21e8366)"
level=info ts=2023-09-28T10:27:38.466976213Z caller=server.go:323 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"

k logs -n cortex cortex-compactor-0
level=info ts=2023-09-28T10:22:48.991038111Z caller=main.go:194 msg="Starting Cortex" version="(version=1.15.3, branch=HEAD, revision=21e8366)"
level=info ts=2023-09-28T10:22:48.991531119Z caller=server.go:323 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2023-09-28T10:22:48.992992641Z caller=module_service.go:64 msg=initialising module=server
level=info ts=2023-09-28T10:22:48.993143843Z caller=module_service.go:64 msg=initialising module=memberlist-kv
level=info ts=2023-09-28T10:22:48.993146343Z caller=module_service.go:64 msg=initialising module=runtime-config
level=info ts=2023-09-28T10:22:48.993390847Z caller=module_service.go:64 msg=initialising module=compactor

k logs -n cortex cortex-ingester-64d6b76cc8-bp5d5
level=info ts=2023-09-28T10:22:37.989674144Z caller=main.go:194 msg="Starting Cortex" version="(version=1.15.3, branch=HEAD, revision=21e8366)"
level=info ts=2023-09-28T10:22:37.989985548Z caller=server.go:323 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"

There is not a single error in your logs.

@davidg-datascene
Copy link
Author

davidg-datascene commented Sep 28, 2023

Yep - the readiness checks are not healthy (as below) so pods never get to a ready state and are eventually killed.

1s          Warning   Unhealthy                pod/cortex-store-gateway-0                    Startup probe failed: Get "http://100.77.7.153:8080/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
25s         Warning   Unhealthy                pod/cortex-compactor-0                        Startup probe failed: HTTP probe failed with statuscode: 503
NAME                             ENDPOINTS                                                          AGE
cortex-alertmanager                                                                                 237d
cortex-compactor                                                                                    237d
cortex-distributor               100.77.7.148:8080,100.77.7.154:8080                                237d
cortex-distributor-headless      100.77.7.148:9095,100.77.7.154:9095                                237d
cortex-ingester                  100.77.7.42:8080,100.77.7.50:8080                                  237d
cortex-ingester-headless         100.77.7.42:9095,100.77.7.50:9095                                  237d
cortex-memberlist                100.77.7.148:7946,100.77.7.154:7946,100.77.7.42:7946 + 1 more...   237d
cortex-nginx                     100.77.7.16:80,100.77.7.23:80                                      237d
cortex-querier                   100.77.7.47:8080                                                   237d
cortex-query-frontend            100.77.7.151:8080,100.77.7.155:8080                                237d
cortex-query-frontend-headless   100.77.7.151:9095,100.77.7.155:9095                                237d
cortex-ruler                                                                                        237d
cortex-store-gateway                                                                                237d
cortex-store-gateway-headless                                                                       237d

@nschad
Copy link
Collaborator

nschad commented Sep 28, 2023

Yep - the readiness checks are not healthy (as below) so pods never get to a ready state and are eventually killed.

1s          Warning   Unhealthy                pod/cortex-store-gateway-0                    Startup probe failed: Get "http://100.77.7.153:8080/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
25s         Warning   Unhealthy                pod/cortex-compactor-0                        Startup probe failed: HTTP probe failed with statuscode: 503
NAME                             ENDPOINTS                                                          AGE
cortex-alertmanager                                                                                 237d
cortex-compactor                                                                                    237d
cortex-distributor               100.77.7.148:8080,100.77.7.154:8080                                237d
cortex-distributor-headless      100.77.7.148:9095,100.77.7.154:9095                                237d
cortex-ingester                  100.77.7.42:8080,100.77.7.50:8080                                  237d
cortex-ingester-headless         100.77.7.42:9095,100.77.7.50:9095                                  237d
cortex-memberlist                100.77.7.148:7946,100.77.7.154:7946,100.77.7.42:7946 + 1 more...   237d
cortex-nginx                     100.77.7.16:80,100.77.7.23:80                                      237d
cortex-querier                   100.77.7.47:8080                                                   237d
cortex-query-frontend            100.77.7.151:8080,100.77.7.155:8080                                237d
cortex-query-frontend-headless   100.77.7.151:9095,100.77.7.155:9095                                237d
cortex-ruler                                                                                        237d
cortex-store-gateway                                                                                237d
cortex-store-gateway-headless                                                                       237d

Can't reproduce. For me everything works fine. Try running

helm dep update .

before upgrading/installing to get the latest memcached.

I tested with:

helm install cortex . -n cortex --create-namespace -f ci/test-sts-values.yaml --set image.tag=v1.15.3

image

@davidg-datascene
Copy link
Author

davidg-datascene commented Sep 28, 2023

Have you tried an upgrade from 1.14.1 to 1.15.3, thanks? Looking at the CHANGELOG - https://github.com/cortexproject/cortex/blob/master/CHANGELOG.md?plain=1 doesn't appear that you need to do anything specific to upgrade on the application side.

If you can confirm the upgrade is okay via the helm chart method than I'll close this call and raise one on the cortex application, thanks.

@nschad
Copy link
Collaborator

nschad commented Sep 28, 2023

@nschad nschad closed this as completed Sep 28, 2023
@davidg-datascene
Copy link
Author

Found reason why my pods not starting - cortexproject/cortex#5449 - I'm using Azure and needed:

 endpoint_suffix: blob.core.windows.net

Thanks for confirming chart was okay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants