-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sqswatcher cannot start on brand new cluster after upgrade to v2.4.0 #1142
Comments
Hi, |
I rolled back to the previous version in order to get some urgent batches done. I will follow up on this thread once i upgrade again and reproduce the error. Sorry for the delay |
@msherman13 I'm not able to reproduce. |
@lukeseawalker just reproduced it. Yes brand new cluster. Here is the pip freeze output: [root@ip-172-31-16-241 centos]# pip freeze |
@lukeseawalker after some more digging, it seems that this issue is related to the paramiko package which I guess you guys are using for remote commands on the cluster nodes? uninstalling and reinstalling paramiko seems to fix the issue, however now when I launch nodes they are getting the below error in the nodewatcher log, which is also coming from paramiko: 2019-06-17 14:06:25,655 INFO [nodewatcher:main] nodewatcher startup |
I think something is getting broken due to some python packages installed in your post_install script. Here is the diff between a cluster without any custom python package and yours:
I think the problem is with cryptography being downgraded to 1.7.2 while paramiko requires > 2.5.0 as you can verify by running a
I would recommend you to install your python libraries in a virtualenv in order not to mess with the system libraries that are currently used by ParallelCluster. If you want to better understand where those libraries are coming from you could run the following:
|
@demartinofra yum install -y awscli |
I believe this comes from your post_install script. If you don't have any confidential data in it you could share it so I can take a look, or you could at least share the part where you install packages (not only with pip). By the way awscli is already installed. |
@demartinofra sure here you go. and good to know that awscli is already installed, guess that could help :)
|
I think you can drop
also I would recommend you to use virtualenv or pipenv to install all python packages you need to run your application so that you don't risk to break system libraries. Please let us know if this solves the issue. |
thank you for the help, seems like removing the awscli line alone fixed the startup issue on the *watcher processes. I will look into the virtualenv solution. Now that I have that fixed, the compute nodes are erroring in nodewatcher. The nodes do appear in qhost on the master node but jobs are not submitted to them 2019-06-17 16:27:29,530 ERROR [utils:_run_command] Command '['/opt/sge/bin/lx-amd64/qstat', '-xml', '-g', 'dt', '-u', '', '-f', '-l', 'hostname=ip-172-31-23-50']' returned non-zero exit status 1 |
What happens if you run qstat on a compute node? I'm afraid the compute nodes setup hasn't been successful completed by the sqswatcher. Are there error or warning messages related to failures when handling events in the sqswatcher logs? I'm going to paste here the output of pip freeze from my cluster so you can see if there are any inconsistencies in your libraries. Also could you verify if
|
also getting numerous errors from sqswatcher on the master node around the same time as the compute nodes error: 2019-06-17 16:37:46,656 INFO [sqswatcher:_parse_sqs_messages] Processing COMPUTE_READY event for instance i-04a2ad48b47b10743 and output: and output: |
Could you get the output of one of those execd_install logs? If you haven't done it yet I would recommend to remove the uneeded packages from post_install and start with a new cluster so that the master node is in a clean state. |
here is the execd_install logs, looks like an issue here. I am going to remove the package installs from the bootstrap script and create the cluster from scratch to see if that fixes it. [centos@ip-172-31-20-40 ~]$ cat /opt/sge/default/common/install_logs/execd_install_ip-172-31-23-65_2019-06-17_16:37:47.log Your $SGE_ROOT directory: /opt/sge Using cell: >default< Specified cluster name >$SGE_CLUSTER_NAME=p6444< resulted in the following conflict! Remove existing component(s) of cluster > p6444 < first! |
so it seems like things are working for me now. the ERROR is still present when a compute node starts, but it seems that the next attempt on the same node is successful and jobs are submitted. maybe some kind of race condition? |
That is normal because, when the node starts, it is not attached to the scheduler yet but it has to wait for the sqswatcher to process the COMPUTE_READY message. We should probably be more clear in the message we write to the logs. I'm going to resolve this issue but feel free to reopen if needed. |
Just wanted to post a followup if anyone else is having this issue - My group was having similar issues with our cluster (sqswatcher/jobwatcher errors 'cannot import name util', etc) upon our recent upgrade. We use packer+ansible to install packages and modify the base parallelcluster AMI before spinning up the cluster. I started cleaning out the cruft in our ami building and post-install routines and eventually resolved the issue by removing ipa-client from our apt install list (it apparently installs a version of gssapi as a dependency which looks like it breaks paramiko, which in turn breaks the 3 parallelcluster node services). Hopefully this is not long-term - sounds like an independent fix for the gssapi source of the issue at least is coming soon to paramiko paramiko/paramiko#1311. |
Environment:
Bug description and how to reproduce:
Just need to create a new cluster with the below config. The below message is logged to /var/log/sqswatcher continuously and new hosts are never registered to qhost.
CONFIG:
[aws]
aws_region_name = us-east-1
aws_access_key_id = <MY_KEY_ID>
aws_secret_access_key = <MY_ACCESS_KEY>
[cluster default]
vpc_settings = default-vpc
key_name = parallelcluster
cluster_type = spot
base_os = centos7
master_instance_type = t2.micro
compute_instance_type = c4.8xlarge
compute_root_volume_size = 1024
max_queue_size = 10
initial_queue_size = 2
ebs_settings = sim-data
scaling_settings = sim-scaling
s3_read_resource = arn:aws:s3:::<MY_BUCKET>/*
post_install = s3://<MY_BUCKET>/bootstrap.sh
[global]
update_check = true
sanity_check = true
cluster_template = default
[aliases]
ssh = ssh -i ~/.ssh/parallelcluster.pem -oStrictHostKeyChecking=no {CFN_USER}@{MASTER_IP} {ARGS}
[vpc default-vpc]
vpc_id = <MY_VPC>
master_subnet_id = <MY_SUBNET>
[ebs sim-data]
shared_dir = sim-data
volume_size = 512
[scaling sim-scaling]
scaledown_idletime = 10
ERRORS IN sqswatcher log on master node:
2019-06-15 03:05:00,419 INFO [sqswatcher:main] sqswatcher startup
2019-06-15 03:05:00,419 INFO [sqswatcher:_get_config] Reading /etc/sqswatcher.cfg
2019-06-15 03:05:00,420 INFO [sqswatcher:_get_config] Configured parameters: region=us-east-1 scheduler=sge sqsqueue=parallelcluster-sim-cluster-SQS-1COTZD6PKTKMT table_name=parallelcluster-sim-cluster-DynamoDBTable-8SHVVL471GI0 cluster_user=centos proxy=NONE stack_name=parallelcluster-sim-cluster
2019-06-15 03:05:00,567 INFO [utils:get_asg_name] ASG parallelcluster-sim-cluster-ComputeFleet-12VBLOW7Q5Q39 found for the stack parallelcluster-sim-cluster
2019-06-15 03:05:00,569 CRITICAL [sqswatcher:main] An unexpected error occurred: cannot import name util
2019-06-15 03:05:30,593 INFO [sqswatcher:main] sqswatcher startup
2019-06-15 03:05:30,593 INFO [sqswatcher:_get_config] Reading /etc/sqswatcher.cfg
2019-06-15 03:05:30,594 INFO [sqswatcher:_get_config] Configured parameters: region=us-east-1 scheduler=sge sqsqueue=parallelcluster-sim-cluster-SQS-1COTZD6PKTKMT table_name=parallelcluster-sim-cluster-DynamoDBTable-8SHVVL471GI0 cluster_user=centos proxy=NONE stack_name=parallelcluster-sim-cluster
2019-06-15 03:05:30,843 INFO [utils:get_asg_name] ASG parallelcluster-sim-cluster-ComputeFleet-12VBLOW7Q5Q39 found for the stack parallelcluster-sim-cluster
2019-06-15 03:05:30,844 CRITICAL [sqswatcher:main] An unexpected error occurred: cannot import name util
The text was updated successfully, but these errors were encountered: