Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqswatcher cannot start on brand new cluster after upgrade to v2.4.0 #1142

Closed
msherman13 opened this issue Jun 15, 2019 · 18 comments
Closed

Comments

@msherman13
Copy link

Environment:

  • AWS ParallelCluster / CfnCluster version: pcluster 2.4.0
  • OS: centos7
  • Scheduler: SGE
  • Master instance type: t2.micro
  • Compute instance type: c4.8xlarge

Bug description and how to reproduce:
Just need to create a new cluster with the below config. The below message is logged to /var/log/sqswatcher continuously and new hosts are never registered to qhost.

CONFIG:
[aws]
aws_region_name = us-east-1
aws_access_key_id = <MY_KEY_ID>
aws_secret_access_key = <MY_ACCESS_KEY>

[cluster default]
vpc_settings = default-vpc
key_name = parallelcluster
cluster_type = spot
base_os = centos7
master_instance_type = t2.micro
compute_instance_type = c4.8xlarge
compute_root_volume_size = 1024
max_queue_size = 10
initial_queue_size = 2
ebs_settings = sim-data
scaling_settings = sim-scaling
s3_read_resource = arn:aws:s3:::<MY_BUCKET>/*
post_install = s3://<MY_BUCKET>/bootstrap.sh

[global]
update_check = true
sanity_check = true
cluster_template = default

[aliases]
ssh = ssh -i ~/.ssh/parallelcluster.pem -oStrictHostKeyChecking=no {CFN_USER}@{MASTER_IP} {ARGS}

[vpc default-vpc]
vpc_id = <MY_VPC>
master_subnet_id = <MY_SUBNET>

[ebs sim-data]
shared_dir = sim-data
volume_size = 512

[scaling sim-scaling]
scaledown_idletime = 10

ERRORS IN sqswatcher log on master node:
2019-06-15 03:05:00,419 INFO [sqswatcher:main] sqswatcher startup
2019-06-15 03:05:00,419 INFO [sqswatcher:_get_config] Reading /etc/sqswatcher.cfg
2019-06-15 03:05:00,420 INFO [sqswatcher:_get_config] Configured parameters: region=us-east-1 scheduler=sge sqsqueue=parallelcluster-sim-cluster-SQS-1COTZD6PKTKMT table_name=parallelcluster-sim-cluster-DynamoDBTable-8SHVVL471GI0 cluster_user=centos proxy=NONE stack_name=parallelcluster-sim-cluster
2019-06-15 03:05:00,567 INFO [utils:get_asg_name] ASG parallelcluster-sim-cluster-ComputeFleet-12VBLOW7Q5Q39 found for the stack parallelcluster-sim-cluster
2019-06-15 03:05:00,569 CRITICAL [sqswatcher:main] An unexpected error occurred: cannot import name util
2019-06-15 03:05:30,593 INFO [sqswatcher:main] sqswatcher startup
2019-06-15 03:05:30,593 INFO [sqswatcher:_get_config] Reading /etc/sqswatcher.cfg
2019-06-15 03:05:30,594 INFO [sqswatcher:_get_config] Configured parameters: region=us-east-1 scheduler=sge sqsqueue=parallelcluster-sim-cluster-SQS-1COTZD6PKTKMT table_name=parallelcluster-sim-cluster-DynamoDBTable-8SHVVL471GI0 cluster_user=centos proxy=NONE stack_name=parallelcluster-sim-cluster
2019-06-15 03:05:30,843 INFO [utils:get_asg_name] ASG parallelcluster-sim-cluster-ComputeFleet-12VBLOW7Q5Q39 found for the stack parallelcluster-sim-cluster
2019-06-15 03:05:30,844 CRITICAL [sqswatcher:main] An unexpected error occurred: cannot import name util

@msherman13 msherman13 changed the title sqswatcher cannot start on branch new cluster after upgrade to v2.4.0 sqswatcher cannot start on brand new cluster after upgrade to v2.4.0 Jun 15, 2019
@lukeseawalker
Copy link
Contributor

Hi,
can you please run 'pip freeze' in the host where the sqswatcher is running and send us the output?

@msherman13
Copy link
Author

I rolled back to the previous version in order to get some urgent batches done. I will follow up on this thread once i upgrade again and reproduce the error. Sorry for the delay

@lukeseawalker
Copy link
Contributor

@msherman13 I'm not able to reproduce.
Are you sure you are creating a brand new cluster?

@msherman13
Copy link
Author

@lukeseawalker just reproduced it. Yes brand new cluster. Here is the pip freeze output:

[root@ip-172-31-16-241 centos]# pip freeze
asn1crypto==0.24.0
aws-cfn-bootstrap==1.4
aws-parallelcluster-node==2.4.0
awscli==1.16.175
Babel==0.9.6
backports.ssl-match-hostname==3.5.0.1
bcrypt==3.1.6
boto3==1.9.165
botocore==1.12.165
cffi==1.6.0
chardet==2.2.1
cloud-init==18.2
colorama==0.3.9
configobj==4.7.2
configparser==3.7.4
cryptography==1.7.2
decorator==3.4.0
docutils==0.14
enum34==1.1.6
future==0.17.1
futures==3.2.0
idna==2.4
iniparse==0.4
ipaddress==1.0.16
IPy==0.75
javapackages==1.0.0
Jinja2==2.7.2
jmespath==0.9.4
jsonpatch==1.2
jsonpointer==1.9
kitchen==1.1.1
lockfile==0.12.2
lxml==3.2.1
lz4==2.1.10
MarkupSafe==0.11
meld3==1.0.2
numpy==1.16.4
pandas==0.24.2
paramiko==2.5.0
perf==0.1
Pillow==2.0.0
ply==3.4
policycoreutils-default-encoding==0.1
prettytable==0.7.2
pyasn1==0.4.5
pycparser==2.19
pycurl==7.19.0
pygobject==3.22.0
pygpgme==0.3
pyliblzma==0.5.3
PyNaCl==1.3.0
pyparsing==1.5.6
pyparted==3.9
pyserial==2.6
pystache==0.5.4
python-daemon==1.6.1
python-dateutil==2.5.0
python-linux-procfs==0.4.9
pytz==2019.1
pyudev==0.15
pyxattr==0.5.1
PyYAML==3.10
requests==2.6.0
retrying==1.3.3
rsa==3.4.2
s3transfer==0.2.1
schedutils==0.4
seobject==0.1
sepolicy==1.1
six==1.9.0
supervisor==3.4.0
urlgrabber==3.10
urllib3==1.25.3
virtualenv==16.6.0
yum-metadata-parser==1.1.4
You are using pip version 18.0, however version 19.1.1 i�s available.
You should consider upgrading via the 'pip install --upg�rade pip' command.

@msherman13
Copy link
Author

@lukeseawalker after some more digging, it seems that this issue is related to the paramiko package which I guess you guys are using for remote commands on the cluster nodes? uninstalling and reinstalling paramiko seems to fix the issue, however now when I launch nodes they are getting the below error in the nodewatcher log, which is also coming from paramiko:

2019-06-17 14:06:25,655 INFO [nodewatcher:main] nodewatcher startup
2019-06-17 14:06:25,655 INFO [nodewatcher:_get_config] Reading /etc/nodewatcher.cfg
2019-06-17 14:06:25,657 INFO [nodewatcher:_get_config] Configured parameters: region=us-east-1 scheduler=sge stack_name=parallelcluster-sim-cluster scaledown_idletime=20 proxy=NONE
2019-06-17 14:06:25,659 CRITICAL [nodewatcher:main] An unexpected error occurred: cannot import name util

@demartinofra
Copy link
Contributor

I think something is getting broken due to some python packages installed in your post_install script.

Here is the diff between a cluster without any custom python package and yours:

[centos@ip-10-0-0-235 ~]$ git diff /tmp/pip.mine /tmp/pip.his
diff --git a/tmp/pip.mine b/tmp/pip.his
index f42eaba..01a2f8f 100644
--- a/tmp/pip.mine
+++ b/tmp/pip.his
@@ -7,18 +7,19 @@ backports.ssl-match-hostname==3.5.0.1
 bcrypt==3.1.6
 boto3==1.9.165
 botocore==1.12.165
-cffi==1.12.3
+cffi==1.6.0
 chardet==2.2.1
 cloud-init==18.2
 colorama==0.3.9
 configobj==4.7.2
 configparser==3.7.4
-cryptography==2.7
+cryptography==1.7.2
 decorator==3.4.0
 docutils==0.14
 enum34==1.1.6
 future==0.17.1
 futures==3.2.0
+idna==2.4
 iniparse==0.4
 ipaddress==1.0.16
 IPy==0.75
@@ -30,10 +31,15 @@ jsonpointer==1.9
 kitchen==1.1.1
 lockfile==0.12.2
 lxml==3.2.1
+lz4==2.1.10
 MarkupSafe==0.11
 meld3==1.0.2
+numpy==1.16.4
+pandas==0.24.2
 paramiko==2.5.0
 perf==0.1
+Pillow==2.0.0
+ply==3.4
 policycoreutils-default-encoding==0.1
 prettytable==0.7.2
 pyasn1==0.4.5
@@ -48,8 +54,9 @@ pyparted==3.9
 pyserial==2.6
 pystache==0.5.4
 python-daemon==1.6.1
-python-dateutil==2.8.0
+python-dateutil==2.5.0
 python-linux-procfs==0.4.9
+pytz==2019.1
 pyudev==0.15
 pyxattr==0.5.1
 PyYAML==3.10

I think the problem is with cryptography being downgraded to 1.7.2 while paramiko requires > 2.5.0 as you can verify by running a pip check command:

[centos@ip-10-0-0-235 ~]$ pip check
paramiko 2.5.0 has requirement cryptography>=2.5, but you have cryptography 1.7.2.

I would recommend you to install your python libraries in a virtualenv in order not to mess with the system libraries that are currently used by ParallelCluster.

If you want to better understand where those libraries are coming from you could run the following:

sudo pip install pipdeptree
pipdeptree

@msherman13
Copy link
Author

@demartinofra
so i looked into the cfn-init.log and i do see that something is installing python-cryptography-1.7.2 and ironically this is coming from the below command. I think that yum is not able to see that the package is already installed in pip and this is causing it to install its own version, which is likely causing some weird conflicts. How do you think I can fix this?

yum install -y awscli

@demartinofra
Copy link
Contributor

demartinofra commented Jun 17, 2019

I believe this comes from your post_install script. If you don't have any confidential data in it you could share it so I can take a look, or you could at least share the part where you install packages (not only with pip).

By the way awscli is already installed.

@msherman13
Copy link
Author

@demartinofra sure here you go. and good to know that awscli is already installed, guess that could help :)

#!/bin/bash

set -e

yum install -y awscli
yum install -y python-pip
yum install -y htop

pip install python-dateutil==2.5.0
pip install boto3
pip install pandas
pip install numpy
pip install lz4

mkdir -p /home/centos/.aws
mkdir -p /rpms
aws s3 cp s3://<BUCKET-NAME>/aws/config /home/centos/.aws/config
aws s3 cp s3://<BUCKET-NAME>/aws/credentials /home/centos/.aws/credentials
aws s3 cp s3://<BUCKET-NAME>/h2o-boost-1.67.0-1.el7.x86_64.rpm /rpms/h2o-boost-1.67.0-1.el7.x86_64.rpm
aws s3 cp s3://<BUCKET-NAME>/h2o-gcc-8.1.0-1.el7.x86_64.rpm /rpms/h2o-gcc-8.1.0-1.el7.x86_64.rpm
yum install -y /rpms/h2o-boost-1.67.0-1.el7.x86_64.rpm
yum install -y /rpms/h2o-gcc-8.1.0-1.el7.x86_64.rpm
yum install -y glibc

@demartinofra
Copy link
Contributor

I think you can drop

yum install -y awscli # already installed
yum install -y python-pip # already installed
pip install python-dateutil==2.5.0  # this unless you have a reason to pin this specific version but this might break parallelcluster.
pip install boto3 # already installed

also I would recommend you to use virtualenv or pipenv to install all python packages you need to run your application so that you don't risk to break system libraries.

Please let us know if this solves the issue.

@msherman13
Copy link
Author

@demartinofra

thank you for the help, seems like removing the awscli line alone fixed the startup issue on the *watcher processes. I will look into the virtualenv solution.

Now that I have that fixed, the compute nodes are erroring in nodewatcher. The nodes do appear in qhost on the master node but jobs are not submitted to them

2019-06-17 16:27:29,530 ERROR [utils:_run_command] Command '['/opt/sge/bin/lx-amd64/qstat', '-xml', '-g', 'dt', '-u', '', '-f', '-l', 'hostname=ip-172-31-23-50']' returned non-zero exit status 1
2019-06-17 16:27:29,530 ERROR [sge:is_node_down] Failed when checking if node is down with exception Command '['/opt/sge/bin/lx-amd64/qstat', '-xml', '-g', 'dt', '-u', '
', '-f', '-l', 'hostname=ip-172-31-23-50']' returned non-zero exit status 1. Reporting node as down.
2019-06-17 16:27:29,530 WARNING [nodewatcher:_poll_wait_for_node_ready] Node reported as down

@demartinofra
Copy link
Contributor

demartinofra commented Jun 17, 2019

What happens if you run qstat on a compute node? I'm afraid the compute nodes setup hasn't been successful completed by the sqswatcher. Are there error or warning messages related to failures when handling events in the sqswatcher logs?

I'm going to paste here the output of pip freeze from my cluster so you can see if there are any inconsistencies in your libraries. Also could you verify if pip check gives you any error?

asn1crypto==0.24.0
aws-cfn-bootstrap==1.4
aws-parallelcluster-node==2.4.0
awscli==1.16.175
Babel==0.9.6
backports.ssl-match-hostname==3.5.0.1
bcrypt==3.1.6
boto3==1.9.165
botocore==1.12.165
cffi==1.12.3
chardet==2.2.1
cloud-init==18.2
colorama==0.3.9
configobj==4.7.2
configparser==3.7.4
cryptography==2.7
decorator==3.4.0
docutils==0.14
enum34==1.1.6
future==0.17.1
futures==3.2.0
iniparse==0.4
ipaddress==1.0.16
IPy==0.75
javapackages==1.0.0
Jinja2==2.7.2
jmespath==0.9.4
jsonpatch==1.2
jsonpointer==1.9
kitchen==1.1.1
lockfile==0.12.2
lxml==3.2.1
MarkupSafe==0.11
meld3==1.0.2
paramiko==2.5.0
perf==0.1
policycoreutils-default-encoding==0.1
prettytable==0.7.2
pyasn1==0.4.5
pycparser==2.19
pycurl==7.19.0
pygobject==3.22.0
pygpgme==0.3
pyliblzma==0.5.3
PyNaCl==1.3.0
pyparsing==1.5.6
pyparted==3.9
pyserial==2.6
pystache==0.5.4
python-daemon==1.6.1
python-dateutil==2.8.0
python-linux-procfs==0.4.9
pyudev==0.15
pyxattr==0.5.1
PyYAML==3.10
requests==2.6.0
retrying==1.3.3
rsa==3.4.2
s3transfer==0.2.1
schedutils==0.4
seobject==0.1
sepolicy==1.1
six==1.9.0
supervisor==3.4.0
urlgrabber==3.10
urllib3==1.25.3
virtualenv==16.6.0
yum-metadata-parser==1.1.4

@msherman13
Copy link
Author

also getting numerous errors from sqswatcher on the master node around the same time as the compute nodes error:

2019-06-17 16:37:46,656 INFO [sqswatcher:_parse_sqs_messages] Processing COMPUTE_READY event for instance i-04a2ad48b47b10743
2019-06-17 16:37:46,656 INFO [sqswatcher:_parse_sqs_messages] Processing COMPUTE_READY event for instance i-0d628426c9b5a3d2b
2019-06-17 16:37:46,656 INFO [sge_commands:exec_qconf_command] Executing operation 'add administrative hosts' for hosts ip-172-31-23-156,ip-172-31-23-65
2019-06-17 16:37:46,672 INFO [sge_commands:exec_qconf_command] Executing operation 'add submit hosts' for hosts ip-172-31-23-156,ip-172-31-23-65
2019-06-17 16:37:46,747 INFO [transport:_log] Connected (version 2.0, client OpenSSH_7.4)
2019-06-17 16:37:46,768 INFO [transport:_log] Connected (version 2.0, client OpenSSH_7.4)
2019-06-17 16:37:46,813 INFO [transport:_log] Authentication (publickey) successful!
2019-06-17 16:37:46,832 INFO [transport:_log] Authentication (publickey) successful!
2019-06-17 16:37:46,835 INFO [remote_command_executor:run_remote_command] Executing remote command command on centos@ip-172-31-23-156: sudo sh -c 'cd /opt/sge && /opt/sge/inst_sge -noremote -x -auto /opt/parallelcluster/templates/sge/sge_inst.conf'
2019-06-17 16:37:46,858 INFO [remote_command_executor:run_remote_command] Executing remote command command on centos@ip-172-31-23-65: sudo sh -c 'cd /opt/sge && /opt/sge/inst_sge -noremote -x -auto /opt/parallelcluster/templates/sge/sge_inst.conf'
2019-06-17 16:37:47,904 ERROR [remote_command_executor:run_remote_command] Command sudo sh -c 'cd /opt/sge && /opt/sge/inst_sge -noremote -x -auto /opt/parallelcluster/templates/sge/sge_inst.conf' failed with error:

and output:
Reading configuration from file /opt/parallelcluster/templates/sge/sge_inst.conf
Install log can be found in: /opt/sge/default/common/install_logs/execd_install_ip-172-31-23-156_2019-06-17_16:37:47.log
2019-06-17 16:37:47,905 ERROR [remote_command_executor:_pickable_run_remote_command] Failed when executing remote command on node ip-172-31-23-156 with error RemoteCommandResult(return_code=1, stdout=u'Reading configuration from file /opt/parallelcluster/templates/sge/sge_inst.conf\nInstall log can be found in: /opt/sge/default/common/install_logs/execd_install_ip-172-31-23-156_2019-06-17_16:37:47.log', stderr='')
2019-06-17 16:37:47,927 ERROR [remote_command_executor:run_remote_command] Command sudo sh -c 'cd /opt/sge && /opt/sge/inst_sge -noremote -x -auto /opt/parallelcluster/templates/sge/sge_inst.conf' failed with error:

and output:
Reading configuration from file /opt/parallelcluster/templates/sge/sge_inst.conf
Install log can be found in: /opt/sge/default/common/install_logs/execd_install_ip-172-31-23-65_2019-06-17_16:37:47.log
2019-06-17 16:37:47,927 ERROR [remote_command_executor:_pickable_run_remote_command] Failed when executing remote command on node ip-172-31-23-65 with error RemoteCommandResult(return_code=1, stdout=u'Reading configuration from file /opt/parallelcluster/templates/sge/sge_inst.conf\nInstall log can be found in: /opt/sge/default/common/install_logs/execd_install_ip-172-31-23-65_2019-06-17_16:37:47.log', stderr='')
2019-06-17 16:37:48,137 INFO [sge_commands:add_hosts_to_group] Adding to @allhosts group
2019-06-17 16:37:48,137 INFO [sge_commands:add_host_slots] Adding to all.q queue
2019-06-17 16:37:48,172 WARNING [sqswatcher:_requeue_message] Re-queuing failed event {"Message": "{"StatusCode":"Complete","Description":"Succesfully launched i-04a2ad48b47b10743","Event":"parallelcluster:COMPUTE_READY","EC2InstanceId":"i-04a2ad48b47b10743","Slots":"18","LocalHostname":"ip-172-31-23-156.ec2.internal"}", "Type": "Notification", "TTL": 1}
2019-06-17 16:37:48,210 WARNING [sqswatcher:_requeue_message] Re-queuing failed event {"Message": "{"StatusCode":"Complete","Description":"Succesfully launched i-0d628426c9b5a3d2b","Event":"parallelcluster:COMPUTE_READY","EC2InstanceId":"i-0d628426c9b5a3d2b","Slots":"18","LocalHostname":"ip-172-31-23-65.ec2.internal"}", "Type": "Notification", "TTL": 1}

@demartinofra
Copy link
Contributor

Could you get the output of one of those execd_install logs?

If you haven't done it yet I would recommend to remove the uneeded packages from post_install and start with a new cluster so that the master node is in a clean state.

@msherman13
Copy link
Author

here is the execd_install logs, looks like an issue here. I am going to remove the package installs from the bootstrap script and create the cluster from scratch to see if that fixes it.

[centos@ip-172-31-20-40 ~]$ cat /opt/sge/default/common/install_logs/execd_install_ip-172-31-23-65_2019-06-17_16:37:47.log

Your $SGE_ROOT directory: /opt/sge

Using cell: >default<

Specified cluster name >$SGE_CLUSTER_NAME=p6444< resulted in the following conflict!
Detected old RC scripts.
/etc/init.d/sgeexecd.p6444

Remove existing component(s) of cluster > p6444 < first!

@msherman13
Copy link
Author

so it seems like things are working for me now. the ERROR is still present when a compute node starts, but it seems that the next attempt on the same node is successful and jobs are submitted. maybe some kind of race condition?

@demartinofra
Copy link
Contributor

That is normal because, when the node starts, it is not attached to the scheduler yet but it has to wait for the sqswatcher to process the COMPUTE_READY message. We should probably be more clear in the message we write to the logs.

I'm going to resolve this issue but feel free to reopen if needed.

@tgjohnst
Copy link

Just wanted to post a followup if anyone else is having this issue - My group was having similar issues with our cluster (sqswatcher/jobwatcher errors 'cannot import name util', etc) upon our recent upgrade. We use packer+ansible to install packages and modify the base parallelcluster AMI before spinning up the cluster. I started cleaning out the cruft in our ami building and post-install routines and eventually resolved the issue by removing ipa-client from our apt install list (it apparently installs a version of gssapi as a dependency which looks like it breaks paramiko, which in turn breaks the 3 parallelcluster node services). Hopefully this is not long-term - sounds like an independent fix for the gssapi source of the issue at least is coming soon to paramiko paramiko/paramiko#1311.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants