Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Error writing node IP when join node, The status has been constantly in Upgrading #648

Closed
xuzheng0017 opened this issue Dec 5, 2023 · 14 comments
Assignees
Labels
bug Something isn't working to test Need to test
Milestone

Comments

@xuzheng0017
Copy link

Describe the bug
Error writing node IP when join node, The status has been constantly in Upgrading.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
image

Environments (please complete the following information):

  • OS: centos7.9
  • AutoK3s Version: 0.9.1

Additional context

time="2023-12-05T11:59:34+08:00" level=info msg="the 4/5 time tring to ssh to 74.48.115.18:22 with user root"
https://mirrors.sonic.net/epel/7/x86_64/repodata/d526a7fd5dbf31d263829b2d144a41ca6126a8ead6d8a75fe0da87b1f250efb1-primary.sqlite.bz2: [Errno 14] HTTPS Error 404 - Not Found
Trying other mirror.
To address this issue please refer to the below wiki article
https://wiki.centos.org/yum-errors
If above article doesn't help to resolve this issue please use https://bugs.centos.org/.
http://mirror.tornadovps.com/pub/epel/7/x86_64/repodata/d526a7fd5dbf31d263829b2d144a41ca6126a8ead6d8a75fe0da87b1f250efb1-primary.sqlite.bz2: [Errno 14] HTTP Error 404 - Not Found
Trying other mirror.
time="2023-12-05T12:00:04+08:00" level=info msg="the 5/5 time tring to ssh to 74.48.115.18:22 with user root"
Package yum-utils-1.1.31-54.el7_8.noarch already installed and latest version
Nothing to do
Loaded plugins: fastestmirror
Command line error: no such option: --refresh
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: mirror.web-ster.com
* epel: lolhost.mm.fcix.net
* extras: mirrors.oit.uci.edu
* updates: mirror.sfo12.us.leaseweb.net
Package yum-utils-1.1.31-54.el7_8.noarch already installed and latest version
Nothing to do
Loaded plugins: fastestmirror
Package yum-utils-1.1.31-54.el7_8.noarch already installed and latest version
Nothing to do
Command line error: no such option: --refresh
Loaded plugins: fastestmirror
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: ix-denver.mm.fcix.net
* epel: mirrors.ocf.berkeley.edu
* extras: mirrors.oit.uci.edu
* updates: mirror.sfo12.us.leaseweb.net
Command line error: no such option: --refresh
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: mirror.web-ster.com
* epel: mirrors.ocf.berkeley.edu
* extras: mirrors.oit.uci.edu
* updates: ix-denver.mm.fcix.net
Package yum-utils-1.1.31-54.el7_8.noarch already installed and latest version
Nothing to do
Loaded plugins: fastestmirror
@xuzheng0017 xuzheng0017 added the bug Something isn't working label Dec 5, 2023
@xuzheng0017 xuzheng0017 changed the title [BUG] [BUG] Error writing node IP when join node, The status has been constantly in Upgrading Dec 5, 2023
@JacieChao
Copy link
Collaborator

Thanks for your feedback.

Are all cluster nodes CentOS 7.9 or only the newly added worker node is using CentOS 7.9?
It seemed that the new worker node could not fetch from the rpm mirror.
Could you please provide the join node parameters and the full log with joined the new node?

@xuzheng0017
Copy link
Author

vps-regtech.log
This is all the logs of this cluster.
all cluster nodes CentOS 7.9, When I added a new batch of nodes, an IP address was written incorrectly. The cluster is currently in an Upgraded state

@JacieChao
Copy link
Collaborator

Is the joining node action stuck at the last line of the log you provide?
It seems like AutoK3s can't access the node 74.48.115.18 by SSH tunnel. Is this node IP the incorrect one?

@xuzheng0017
Copy link
Author

Q1: Yes
Q2: 74.48.115.18 is wrong, I made a mistake in writing. emmmmmm...

@cnrancher cnrancher deleted a comment from JacieChao Dec 5, 2023
@JacieChao
Copy link
Collaborator

Sometimes the native provider can't catch the join error correctly. When this situation happens, the cluster's status will be Upgrade forever.
I will find out if there's a workaround.

@xuzheng0017
Copy link
Author

Okay, I'll rebuild the cluster. Thank you for your answer.
Best wishes to you.

@JacieChao
Copy link
Collaborator

@xuzheng0017 There's no need to rebuild the cluster. The K3s cluster won't impact by the AutoK3s cluster status.

@xuzheng0017
Copy link
Author

Okay, but I want to join other nodes to the page without any options.

@JacieChao JacieChao self-assigned this Dec 6, 2023
@JacieChao JacieChao added this to the v0.9.2 milestone Dec 6, 2023
@JacieChao
Copy link
Collaborator

The workaround below may help you:

  • Using kubectl get nodes to check whether the batch of nodes have joined to the cluster successfully.
  • If not, use autok3s join CLI to join a node to refresh the cluster status.
autok3s join -p native --name jacie-test --ip <master-ip> --ssh-user <your-ssh-user> --ssh-key-path <your-ssh-key-path> --worker-ips <one-worker-ip>

Once the join process is complete, the cluster status will be refreshed to Running and the UI can work properly.

@JacieChao
Copy link
Collaborator

The bug is relative to the wrong catch of error in defer function. Will fix this in the next version.

@xuzheng0017
Copy link
Author

I have encountered another problem:
When I deleted a node in kube-explorer and then returned to the cluster page, the number of nodes did not decrease.
I added the node again with the command:

81d5d17a77de:/home/shell # autok3s join -p native --name vps-cargogo --ip xx.xx.xx.xx --ssh-user root --ssh-key-path /root/.autok3s/vps-cargogo/id_rsa --worker-ips xx.xx.xx.xx
time="2023-12-06T14:53:03+08:00" level=info msg="[native] begin to join nodes for vps-cargogo..."
time="2023-12-06T14:53:03+08:00" level=info msg="[native] executing join k3s node logic"
time="2023-12-06T14:53:03+08:00" level=info msg="[native] successfully executed join k3s node logic"
time="2023-12-06T14:53:03+08:00" level=info msg="[native] successfully executed join logic"

@xuzheng0017
Copy link
Author

Can only use commands on nodes to rejoin?

@JacieChao
Copy link
Collaborator

JacieChao commented Dec 6, 2023

Yes. AutoK3s can't synchronize your operation because the node was removed manually and didn't synchronize the AutoK3s database. So you can't rejoin the node by AutoK3s because the node is already in the cluster by AutoK3s side.
The workaround is to add the node back by K3s CLI manually for now.

@JacieChao
Copy link
Collaborator

tested with v0.9.2-rc1. AutoK3s can return the correct status of the cluster if join nodes fail.
Close as complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working to test Need to test
Development

No branches or pull requests

2 participants