Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a cluster option for workshops #891

Merged
merged 21 commits into from
Aug 19, 2020

Conversation

termlen0
Copy link
Contributor

SUMMARY

This PR introduces the option of running any of the workshops with Tower as a cluster.

  • Users will need to add the create_cluster: yes option to their vars file for this to work.
ISSUE TYPE
  • Feature Pull Request
COMPONENT NAME
  • provisioner
ADDITIONAL INFORMATION

This option will allow SAs to build even better demos around scaling/loadbalancing(RBAC).
Additionally, when provisioned for workshops, it will help highlight Tower scaling/clustering features to the customer.

CC: @IPvSean


@cigamit
Copy link
Contributor

cigamit commented May 27, 2020

I made a few comments inline just from things I saw. I like this option, but a few things I would fix. I will run a few tests tonight to see what else I notice. Sean isn't a fan of loops for building VMs, so we may have to think outside the box on how to do them all at once, but tag each one separately for which node it is.

@cloin
Copy link
Contributor

cloin commented Jun 2, 2020

Fails in security workshop deploy and rhel-verify and I think that's fine:

[2020-05-29T13:36:37.658Z] TASK [manage_ec2_instances : provision workshop instances] *********************

[2020-05-29T13:36:37.658Z] ERROR! Unexpected Exception, this is probably a bug: 'NoneType' object has no attribute 'rfind'

[2020-05-29T13:36:37.658Z] to see the full traceback, use -vvv
[2020-05-29T14:06:09.263Z] TASK [Test access by exporting assets] *****************************************

[2020-05-29T14:06:09.263Z] skipping: [student2-node1]

[2020-05-29T14:06:09.263Z] skipping: [student2-node2]

[2020-05-29T14:06:09.263Z] skipping: [student2-node3]

[2020-05-29T14:06:09.531Z] skipping: [student1-node1]

[2020-05-29T14:06:09.531Z] skipping: [student1-node3]

[2020-05-29T14:06:09.531Z] skipping: [student1-node2]

[2020-05-29T14:06:10.468Z] fatal: [student2-ansible-1]: FAILED! => changed=false 

[2020-05-29T14:06:10.468Z]   assets: null

[2020-05-29T14:06:10.468Z]   message: |-

[2020-05-29T14:06:10.468Z]     There was a network error of some kind trying to connect to Tower.

[2020-05-29T14:06:10.468Z]   

[2020-05-29T14:06:10.468Z]     The most common  reason for this is a settings issue; is your "host" value in `tower-cli config` correct?

[2020-05-29T14:06:10.468Z]     Right now it is: "student2-1.tqe-rhel-tower370-PR-891-3.rhdemo.io".

[2020-05-29T14:06:10.468Z]   msg: Receive Failed

[2020-05-29T14:06:10.468Z] fatal: [student1-ansible-1]: FAILED! => changed=false 

[2020-05-29T14:06:10.468Z]   assets: null

[2020-05-29T14:06:10.468Z]   message: |-

[2020-05-29T14:06:10.468Z]     There was a network error of some kind trying to connect to Tower.

[2020-05-29T14:06:10.468Z]   

[2020-05-29T14:06:10.468Z]     The most common  reason for this is a settings issue; is your "host" value in `tower-cli config` correct?

[2020-05-29T14:06:10.468Z]     Right now it is: "student1-1.tqe-rhel-tower370-PR-891-3.rhdemo.io".

[2020-05-29T14:06:10.468Z]   msg: Receive Failed

@Spredzy
Copy link
Collaborator

Spredzy commented Jun 2, 2020

recheck

@cloin
Copy link
Contributor

cloin commented Jun 5, 2020

Recheck

@termlen0
Copy link
Contributor Author

termlen0 commented Jun 5, 2020

@cloin / @Spredzy Let me know what I can do to help move this PR forward.

@cloin
Copy link
Contributor

cloin commented Jun 8, 2020

The check failures don’t seem to be related to changes on this PR. @liquidat how do you feel about this PR? Can you please review?

Copy link
Contributor

@liquidat liquidat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if an update to Tower 3.7 is possible. It seems to be a missed opportunity if we do not start with 3.7 and have to do updates again for 3.7.

@liquidat
Copy link
Contributor

liquidat commented Jun 8, 2020

@goetzrieger What do you think? Maybe we can start the advanced Tower lab right off of this RPR? Could simplify a lot of work!

@termlen0
Copy link
Contributor Author

termlen0 commented Jun 8, 2020 via email

@termlen0
Copy link
Contributor Author

termlen0 commented Jun 8, 2020 via email

@goetzrieger
Copy link
Contributor

First for @termlen0 : This is something we really need. @cbolz and me built something like this for Summit to run our Advanced Tower lab on (advanced_tower branch). But we didn't find the time to get it into devel (not talking about master) proper so it would be a lot of work now (there where a lot of changes in the mean time...). But we need a clustered env in master to get the lab into RHPDS...

I just had the time for a brief look into what you have done but will try to test, maybe @cbolz can have a look, too.

It would be great to get this working and into devel/master. The Advanced Tower lab itself doesn't have a lot of requirements regarding the lab env but a three-node cluster, but we need to check.

@liquidat
Copy link
Contributor

liquidat commented Jun 9, 2020

As for Tower 3.7, these rules will allow workshops to be spun up with older versions. As an SA sometimes we might use this provisioner to simulate customer environments.

Maybe, but it would be news to me that this is a desired feature of the workshops?!

And while it does not effect the actual deployment, it introduces an entire set of legacy code: rabbitmq configuration in the inventory, various firewall rules, etc.

That is not a deal breaker for me, but I would very much prefer to focus on up2date Tower releases. We also do not cater to people anymore who want to use RHEL 7, or other older Tower or Ansible releases.

@goetzrieger
Copy link
Contributor

goetzrieger commented Jun 10, 2020

I did some tests today with the RHEL workshop:

  • The Epel $releasever fix #892 fix is missing in your fork so failed in Install EPEL
  • The workbench info on the landing page has (the same) entries for all 4 control nodes... :)
  • Apart from that looks good so far, but I didn't test anything but deploy yet

@termlen0
Copy link
Contributor Author

termlen0 commented Jun 11, 2020 via email

@goetzrieger
Copy link
Contributor

goetzrieger commented Jun 11, 2020

Some more thoughts:

  • To avoid creating instances in a loop the most pragmatic way might be to just include a cluster_instances.yml or a single_instance.yml (or so) file in manage_ec2_instances
  • As far as I can see you still have all control nodes in the control_nodes group and then you separate tasks by running them either on ansible-1 or ansible-2 to ansible-4 (instead of using groups) in provision_lab.yml. IMHO this is asking for trouble, the group control_nodes could accidentally be used somewhere else, renaming nodes is hard etc. Having a dedicated group for the cluster nodes might be cleaner. Disclaimer: we did this for the Summit env and it gave us loads of fun/pain.
  • This is more cosmetic, but I would like names like tower-1, tower-2 better so students don't get confused (too easily ;).

I know it's hard to get clustering into the Workshops as unintrusively as possible. I'm happy to help if I can.

@termlen0
Copy link
Contributor Author

Per review from @goetzrieger , I've re-based to accommodate PR #892. I've also updated the landing page J2 to only display the main control node details. Tested with RHEL workshop.

@termlen0
Copy link
Contributor Author

Some more thoughts:

* To avoid creating instances in a loop the most pragmatic way might be to just include a cluster_instances.yml or a single_instance.yml (or so) file in manage_ec2_instances

This is what is currently being done. See:

- name: Create the control clusters
  include_tasks: cluster_instances.yml
  loop: "{{ range(1, control_nodes|default(1) + 1 ) | list }}"
  loop_control:
    loop_var: sequence

IMHO this is asking for trouble, the group control_nodes could accidentally be used somewhere else, renaming nodes is hard etc. Having a dedicated group for the cluster nodes might be cleaner. Disclaimer: we did this for the Summit env and it gave us loads of fun/pain.

While I agree that it might be a cleaner approach, I fail to see how using the control group for anything else can cause an issue. In the spirit of getting a cluster option in place, I suggest we table this for a future PR.

* This is more cosmetic, but I would like names like tower-1, tower-2 better so students don't get confused (too easily ;).

I'm personally not opinionated one way or the other about the naming. Will let others comment on it. But again, not a show-stopper for this PR, IMO.

@cbolz
Copy link

cbolz commented Jun 19, 2020

I like the approach - it's less intrusive then what we hacked together for Summit.

Personally, I would probably split out changes like switching from private to public IPs and upgrading to 3.7 into separate PR's, but I guess that's something up for debate and is just different ways of working.

I also agree with @liquidat that the purpose of the workshop is to have the latest ansible releases and not to carry technical debt to support all sorts of old releases. IMHO that's out of scope for this project.

@termlen0 termlen0 requested a review from liquidat July 7, 2020 21:11
@goetzrieger
Copy link
Contributor

goetzrieger commented Jul 15, 2020

@cloin @liquidat ?

We're just going through the pain of using our old cluster provisioner for Summit Open House. It'd be great to have this setup going before the next event... :)

@liquidat
Copy link
Contributor

@termlen0 This is shaping up really nicely, I love it! One small thing missing: we need a sample vars file or at least entry for each new option we bring in, can you add this?

Also, while I would say @goetzrieger is a bit too cautious with the thoughts around control_nodes, we should at leas make sure that the other labs are working on it. So we need to fix these lines where stuff is install on control_nodes:

Number one and three can just be rewritten to the ansible-1, but I am not sure about number 2, maybe @cloin can help?

@goetzrieger
Copy link
Contributor

Ajay, I know this is a bit selfish, but while you are at it:

Could we have four managed nodes, ideally node1/node2, then isonode and remotenode?

Then this environment would line up with the Advanced Tower lab [1] perfectly. If this is asked too much I'll give it a shot later.

[1] https://people.redhat.com/grieger/summit2020_labs/ansible-tower-advanced/8-isolated-nodes/

@termlen0
Copy link
Contributor Author

Ajay, I know this is a bit selfish, but while you are at it:

Could we have four managed nodes, ideally node1/node2, then isonode and remotenode?

Then this environment would line up with the Advanced Tower lab [1] perfectly. If this is asked too much I'll give it a shot later.

[1] https://people.redhat.com/grieger/summit2020_labs/ansible-tower-advanced/8-isolated-nodes/
I'll try. For now, path of least resistance for me to get this PR into devel is to try and address @liquidat 's 3 items. I'm going to try and get that in first and once the PR is merged into devel, start a new feature branch to refactor it per your suggestion. Hope that works. :)
Cheers.

@termlen0
Copy link
Contributor Author

@termlen0 This is shaping up really nicely, I love it! One small thing missing: we need a sample vars file or at least entry for each new option we bring in, can you add this?

Also, while I would say @goetzrieger is a bit too cautious with the thoughts around control_nodes, we should at leas make sure that the other labs are working on it. So we need to fix these lines where stuff is install on control_nodes:

* https://github.com/ansible/workshops/blob/2ff691f0471a0f3f2d9a4753dc395510b9e98471/provisioner/windows.yml#L17

* https://github.com/ansible/workshops/blob/2ff691f0471a0f3f2d9a4753dc395510b9e98471/provisioner/roles/workshop_attendance/templates/workshop.sql.j2#L19

* https://github.com/ansible/workshops/blob/2ff691f0471a0f3f2d9a4753dc395510b9e98471/provisioner/security.yml#L111

Number one and three can just be rewritten to the ansible-1, but I am not sure about number 2, maybe @cloin can help?

I've made changes to address all 3 items. I've tested against the windows workshop. As for the sample vars, I'll update all existing samplevars files with the create_cluster boolean var and set it to "no" by default.

@liquidat
Copy link
Contributor

liquidat commented Aug 5, 2020

recheck

1 similar comment
@liquidat
Copy link
Contributor

recheck

@liquidat
Copy link
Contributor

We have builds failing.

First, RHEL verify is failing:

[2020-08-07T18:08:43.603Z] TASK [Test access by exporting assets] *****************************************
[2020-08-07T18:08:43.882Z] skipping: [student1-node1]
[2020-08-07T18:08:43.882Z] skipping: [student1-node2]
[2020-08-07T18:08:43.882Z] skipping: [student1-node3]
[2020-08-07T18:08:43.882Z] skipping: [student2-node1]
[2020-08-07T18:08:43.882Z] skipping: [student2-node2]
[2020-08-07T18:08:44.142Z] skipping: [student2-node3]
[2020-08-07T18:08:45.510Z] FAILED - RETRYING: Test access by exporting assets (60 retries left).
[...]
[2020-08-07T18:12:40.901Z] FAILED - RETRYING: Test access by exporting assets (1 retries left).
[2020-08-07T18:12:44.177Z] fatal: [student2-ansible-1]: FAILED! => changed=false 
[2020-08-07T18:12:44.177Z]   assets: null
[2020-08-07T18:12:44.177Z]   attempts: 60
[2020-08-07T18:12:44.177Z]   message: |-
[2020-08-07T18:12:44.177Z]     There was a network error of some kind trying to connect to Tower.
[2020-08-07T18:12:44.177Z]   
[2020-08-07T18:12:44.177Z]     The most common  reason for this is a settings issue; is your "host" value in `tower-cli config` correct?
[2020-08-07T18:12:44.177Z]     Right now it is: "student2-1.tqe-rhel-tower371-PR-891-32.rhdemo.io".
[2020-08-07T18:12:44.177Z]   msg: Receive Failed
[2020-08-07T18:12:44.433Z] fatal: [student1-ansible-1]: FAILED! => changed=false 
[2020-08-07T18:12:44.433Z]   assets: null
[2020-08-07T18:12:44.433Z]   attempts: 60
[2020-08-07T18:12:44.433Z]   message: |-
[2020-08-07T18:12:44.433Z]     There was a network error of some kind trying to connect to Tower.
[2020-08-07T18:12:44.433Z]   
[2020-08-07T18:12:44.433Z]     The most common  reason for this is a settings issue; is your "host" value in `tower-cli config` correct?
[2020-08-07T18:12:44.433Z]     Right now it is: "student1-1.tqe-rhel-tower371-PR-891-32.rhdemo.io".
[2020-08-07T18:12:44.433Z]   msg: Receive Failed

I think we need to modify this line in the testing script:

tower_host: "{{ inventory_hostname|regex_replace('-ansible', '') }}.{{ workshop_name }}.rhdemo.io"

Second, security deployment fails:

[2020-08-07T17:38:26.714Z] TASK [manage_ec2_instances : provision workshop instances] *********************
[2020-08-07T17:38:26.714Z] ERROR! Unexpected Exception, this is probably a bug: expected str, bytes or os.PathLike object, not NoneType

This would be this line:

include_tasks: 'instances/instances_{{ workshop_type }}.yml'

Honestly, I have no idea what is going on. I will try to provision on my own from this branch and see if I can replicate the problem.

@goetzrieger
Copy link
Contributor

goetzrieger commented Aug 11, 2020

Easiest fix for RHEL verify fail IMO (if we want to stay with Ajay's naming convention):

file: provisioner/tests/rhel_verify.yml

tower_host: "{{ inventory_hostname|regex_replace('-ansible-1', '') }}.{{ workshop_name }}.rhdemo.io"
[...]
when: '"ansible-1" in inventory_hostname'

Tested with cluster and non-cluster RHEL WS.

@termlen0

@liquidat
Copy link
Contributor

@termlen0 Can you please include @goetzrieger 's patch and also rebase? After the rebase I can track down the new bug, but right now without a rebase it is rather hard.

@termlen0
Copy link
Contributor Author

termlen0 commented Aug 17, 2020 via email

@liquidat
Copy link
Contributor

Build still fails here, this time with more information:

TASK [manage_ec2_instances : provision workshop instances] ***********************************************************************************************************************************
task path: /home/rwolters/gits/github/termlen0-linklight/provisioner/roles/manage_ec2_instances/tasks/provision.yml:47
ERROR! Unexpected Exception, this is probably a bug: expected str, bytes or os.PathLike object, not NoneType
the full traceback was:

Traceback (most recent call last):
  File "/home/rwolters/development/venv_ansible_2.9/bin/ansible-playbook", line 123, in <module>
    exit_code = cli.run()
  File "/home/rwolters/development/venv_ansible_2.9/lib64/python3.8/site-packages/ansible/cli/playbook.py", line 127, in run
    results = pbex.run()
  File "/home/rwolters/development/venv_ansible_2.9/lib64/python3.8/site-packages/ansible/executor/playbook_executor.py", line 169, in run
    result = self._tqm.run(play=play)
  File "/home/rwolters/development/venv_ansible_2.9/lib64/python3.8/site-packages/ansible/executor/task_queue_manager.py", line 241, in run
    play_return = strategy.run(iterator, play_context)
  File "/home/rwolters/development/venv_ansible_2.9/lib64/python3.8/site-packages/ansible/plugins/strategy/linear.py", line 359, in run
    new_blocks = self._load_included_file(included_file, iterator=iterator)
  File "/home/rwolters/development/venv_ansible_2.9/lib64/python3.8/site-packages/ansible/plugins/strategy/__init__.py", line 890, in _load_included_file
    block_list = load_list_of_blocks(
  File "/home/rwolters/development/venv_ansible_2.9/lib64/python3.8/site-packages/ansible/playbook/helpers.py", line 70, in load_list_of_blocks
    Block.load(
  File "/home/rwolters/development/venv_ansible_2.9/lib64/python3.8/site-packages/ansible/playbook/block.py", line 94, in load
    return b.load_data(data, variable_manager=variable_manager, loader=loader)
  File "/home/rwolters/development/venv_ansible_2.9/lib64/python3.8/site-packages/ansible/playbook/base.py", line 235, in load_data
    self._attributes[target_name] = method(name, ds[name])
  File "/home/rwolters/development/venv_ansible_2.9/lib64/python3.8/site-packages/ansible/playbook/block.py", line 122, in _load_block
    return load_list_of_tasks(
  File "/home/rwolters/development/venv_ansible_2.9/lib64/python3.8/site-packages/ansible/playbook/helpers.py", line 191, in load_list_of_tasks
    parent_include_dir = os.path.dirname(templar.template(parent_include.args.get('_raw_params')))
  File "/usr/lib64/python3.8/posixpath.py", line 152, in dirname
    p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType

@liquidat
Copy link
Contributor

So, this looks like an evil Ansible bug. For me, this change made it working:

diff --git a/provisioner/provision_lab.yml b/provisioner/provision_lab.yml
index 2f86777f..50487323 100644
--- a/provisioner/provision_lab.yml
+++ b/provisioner/provision_lab.yml
@@ -18,15 +18,14 @@
   connection: local
   become: false
   gather_facts: false
-  tasks:
+  pre_tasks:
     - name: Cluster nodes
       set_fact:
         control_nodes: 4
       when: create_cluster is defined and create_cluster|bool
 
-    - name: Manage EC2
-      include_role:
-        name: manage_ec2_instances
+  roles:
+    - manage_ec2_instances
 
 - name: wait for all nodes to have SSH reachability
   hosts: "managed_nodes:control_nodes:attendance"

@termlen0 You mentioned in chat you don't see this behavior. I tested with ansible version 2.9.9 and 2.9.12, both times I see the same problem.
Any reason why we could not adopt the change mentioned above? Besides the fact that it is ugly?

@liquidat
Copy link
Contributor

@termlen0 Security deployment is mostly fine. The RHLE verify script still fails as mentioned above, @goetzrieger had a patch, did you include it?

There is an error with the security verify as well, but I'd like a recheck to be sure that this is not a fluke.

@termlen0
Copy link
Contributor Author

@termlen0 Security deployment is mostly fine. The RHLE verify script still fails as mentioned above, @goetzrieger had a patch, did you include it?

There is an error with the security verify as well, but I'd like a recheck to be sure that this is not a fluke.

Just committed.

@liquidat
Copy link
Contributor

@termlen0 We missed something in the security workshop: the checkpoint stuff isn't even called and thus the test fails. Can you please add this patch?

diff --git a/provisioner/roles/cp_setup/tasks/main.yml b/provisioner/roles/cp_setup/tasks/main.yml
index 60b2105f..1a68a42a 100644
--- a/provisioner/roles/cp_setup/tasks/main.yml
+++ b/provisioner/roles/cp_setup/tasks/main.yml
@@ -1,7 +1,7 @@
 ---
 - name: login, get SID
   uri:
-    url: "https://{{ hostvars[inventory_hostname|regex_replace('ansible', 'checkpoint_mgmt')]['private_ip'] }}/web_api/login"
+    url: "https://{{ hostvars[inventory_hostname|regex_replace('ansible-1', 'checkpoint_mgmt')]['private_ip'] }}/web_api/login"
     method: POST
     body:
       user: admin
@@ -15,7 +15,7 @@
 
 - name: Add NGFW to MGMT
   uri:
-    url: "https://{{ hostvars[inventory_hostname|regex_replace('ansible', 'checkpoint_mgmt')]['private_ip'] }}/web_api/add-simple-gateway"
+    url: "https://{{ hostvars[inventory_hostname|regex_replace('ansible-1', 'checkpoint_mgmt')]['private_ip'] }}/web_api/add-simple-gateway"
     validate_certs: false
     method: POST
     headers:
@@ -24,7 +24,7 @@
     body_format: json
     body:
       name: myngfw
-      ip-address: "{{ hostvars[inventory_hostname|regex_replace('ansible', 'checkpoint_gw')]['private_ip'] }}"
+      ip-address: "{{ hostvars[inventory_hostname|regex_replace('ansible-1', 'checkpoint_gw')]['private_ip'] }}"
       one-time-password: admin123
       firewall: true
       version: R80.30
@@ -34,7 +34,7 @@
           anti-spoofing-settings:
             action: prevent
           name: "eth0"
-          ip-address: "{{ hostvars[inventory_hostname|regex_replace('ansible', 'checkpoint_gw')]['private_ip'] }}"
+          ip-address: "{{ hostvars[inventory_hostname|regex_replace('ansible-1', 'checkpoint_gw')]['private_ip'] }}"
           network-mask: "255.255.0.0"
           ipv4-mask-length: 16
           security-zone: false
@@ -45,7 +45,7 @@
           anti-spoofing-settings:
             action: prevent
           name: "eth1"
-          ip-address: "{{ hostvars[inventory_hostname|regex_replace('ansible', 'checkpoint_gw')]['private_ip2'] }}"
+          ip-address: "{{ hostvars[inventory_hostname|regex_replace('ansible-1', 'checkpoint_gw')]['private_ip2'] }}"
           network-mask: "255.255.0.0"
           ipv4-mask-length: 16
           security-zone: false
@@ -53,7 +53,7 @@
 
 - name: Publish
   uri:
-    url: "https://{{ hostvars[inventory_hostname|regex_replace('ansible', 'checkpoint_mgmt')]['private_ip'] }}/web_api/publish"
+    url: "https://{{ hostvars[inventory_hostname|regex_replace('ansible-1', 'checkpoint_mgmt')]['private_ip'] }}/web_api/publish"
     validate_certs: false
     method: POST
     headers:
@@ -67,7 +67,7 @@
   ec2_instance_info:
     region: "{{ ec2_region }}"
     filters:
-      "tag:Name": "{{ inventory_hostname|regex_replace('ansible', 'checkpoint_gw') }}"
+      "tag:Name": "{{ inventory_hostname|regex_replace('ansible-1', 'checkpoint_gw') }}"
       "instance-state-name": running
   register: gw_inst
   delegate_to: localhost
diff --git a/provisioner/security.yml b/provisioner/security.yml
index 53dcbc38..c3e4b25b 100644
--- a/provisioner/security.yml
+++ b/provisioner/security.yml
@@ -108,6 +108,6 @@
     - role: cp_fix_mgmt
 
 - name: SETUP CHECKPOINT ENVIRONMENT
-  hosts: ansible-1
+  hosts: '*ansible-1'
   roles:
     - role: cp_setup

With this patch my tests are all good.

@liquidat liquidat merged commit 66a6f07 into ansible:devel Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants