Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat(eos_validate_state): Add ANTA integration to eos_validate_state role #3171

Merged
merged 42 commits into from Oct 2, 2023

Conversation

gmuloc
Copy link
Contributor

@gmuloc gmuloc commented Sep 26, 2023

Change Summary

Integration of ANTA Python framework to eos_validate_state role.

Tasks List

  • (hardware) Validate environment (power supplies status)
  • (hardware) Validate environment (fan status).
  • (hardware) Validate environment (temperature).
  • (hardware) Validate transceivers manufacturer.
  • (NTP) Validate NTP status.
  • (interface_state) Validate Ethernet interfaces admin and operational status.
  • (interface_state) Validate Port-Channel interfaces admin and operational status.
  • (interface_state) Validate VLAN interfaces admin and operational status.
  • (interface_state) Validate VXLAN interfaces admin and operational status.
  • (interface_state) Validate Loopback interfaces admin and operational status.
  • (lldp_topology_fqdn) Validate LLDP topology when there is a domain name configured.
  • (lldp_topology_no_fqdn) Validate LLDP topology when there is no domain name configured.
  • (MLAG) Validate MLAG status.
  • (ip_reachability) Validate IP reachability (on directly connected interfaces).
  • (loopback_reachability) Validate loopback reachability (between devices).
  • (bgp_check) Validate ArBGP is configured and operating.
  • (bgp_check) Validate IP BGP and BGP EVPN sessions state.
  • (reload_cause) Validate last reload cause. (Optional)
  • (routing_table) Validate remote Lo0 addresses and remote Lo1 addresses are in the routing table (based on devices type).
  • schema -> probably not needed as part of this PR except if we want to add the 3 accepted_ values - later discussion
  • make eos_validate_state agnostic of eos_designs -> by adding knobs in structured_config where it is required eos_validate_state should not rely on eos_designs variable type #1047 - will be addressed after this PR
  • discuss mode strict behavior (vs mode loose) - not implemented for anta mode
  • Usage of display (ansible future proofing) ? Do we need all of them that exist today or are some present only because of the impossibility to remove "skipped" messages on Ansible -> make it a logger
  • Release tag for anta - with CACHE
  • Use check flag in ansible
  • module documentation -> DONE need review - the module is considered as preview
  • eos_validate_state preview validation - make the "feature" a preview for first release
    • add warning to state that loose mode is ignored
    • Add all input variables as per documented in the current overview document
  • testing current eos_validate_state and anta based eos_validate_test and make sure there is no change - document the changes for the tests
  • Use anta variable instead of tag for running eos_validate_state.
  • Fixes Feat(eos_validate_state): Detailed report for tests to be executed. #3138

Postponed Tasks List

Component(s) name

arista.avd.eos_validate_state

Proposed changes

  • eos_validate_state_runner.py is an action plugin to run the ANTA framework. It calls the get_anta_results function to grab the test results.
  • get_anta_results.py is responsible of creating the ANTA tests catalog and the other required objects to run ANTA.
  • Connections to the devices are made with the regular Ansible httpapi connection plugin.
  • Results from ANTA are returned as JSON and registered as the anta_results variable in Ansible.
  • ANTA related tasks have been added to eos_validate_state tasks/main.yml and tagged anta to keep the legacy eos_validate_state working.
  • A new task has been added to tasks/reports.yml to generate ANTA results, leveraging yaml_templates_to_facts plugin and template/generate_anta_report_results.j2 template specific to ANTA results but keeping the same format.
  • The following Python classes have been created to replace the existing eos_validate_state tests:
__all__ = [
    "AvdTestP2PIPReachability",
    "AvdTestLoopback0Reachability",
    "AvdTestLLDPTopology",
    "AvdTestInbandReachability",
    "AvdTestHardware",
    "AvdTestMLAG",
    "AvdTestNTP",
    "AvdTestReloadCause",
    "AvdTestInterfacesState",
    "AvdTestRoutingTable",
    "AvdTestBGP",
]

How to test

  • Install the main branch of anta:
    pip install git+https://github.com/arista-netdevops-community/anta.git
  • eos_validate_state can be used as usual, with --tags anta if you want to run the role with ANTA:
    ansible-playbook playbooks/fabric-validate.yaml --tags anta
    Not providing the anta tag will run the regular eos_validate_state
  • Legacy Ansible tags are also supported if you want to run/skip tests:
    ansible-playbook playbooks/fabric-validate.yaml --tags anta,routing_table

The variable skipped_tests can now be used for running/skipping tests:

skipped_tests:
    AvdTestHardware:

You can also decide to skip specific subtests (ANTA test) for more granularity:

skipped_tests:
    AvdTestHardware:
        - VerifyTransceiversTemperature

Checklist

User Checklist

  • N/A

Repository Checklist

  • My code has been rebased from devel before I start
  • I have read the CONTRIBUTING document.
  • My change requires a change to the documentation and documentation have been updated accordingly.
  • I have updated molecule CI testing accordingly. (check the box if not applicable)

@gmuloc gmuloc requested review from a team as code owners September 26, 2023 14:34
@github-actions github-actions bot added type: documentation Improvements or additions to documentation type: code quality CI and development toolset role: eos_validate_state state: CI Updated CI scenario have been updated in the PR labels Sep 26, 2023
@gmuloc
Copy link
Contributor Author

gmuloc commented Sep 26, 2023

Remove anta_ from the module parameters

@github-actions github-actions bot added the role: build_output_folders issue related to build_output_folders role label Sep 29, 2023
@chetryan
Copy link
Contributor

In my tests, when I set dc2-leaf1c and dc2-leaf2c (the L2 leave) and set shutdown_interfaces_towards_undeployed_peers: True, I noticed:

  • the interface check on interfaces towards the L2 Leafs were set to check for 'down' status as expected
  • The LLDP neighborchip check on L3 Leafs still expected to find the L2 Leaf downstream I think this is a bug

Copy link
Contributor

@chetryan chetryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes to the documentation.

@chetryan
Copy link
Contributor

When an interface is shutdown, I can still see VerifyReachability tests created for the peer IP address on that interface.
This is an existing behaviour in current eos_validate_state that has caused confusion/grip about false positives before as well.

@chetryan
Copy link
Contributor

Similarly - when an interface is shutdown, the BGP peer check VerifyBGPSpecificPeers is still created for the peer on that interface. Is this in scope to be solved in this PR ? Without this, we would still have false positive when some devices are not still deployed.

@chetryan
Copy link
Contributor

When the devices are not present, I see that the eos_validate_state was successful , even though these devices are not reachable. I would expect that the task Run eos_validate_state_runner leveraging ANTA would fail for unreachable devices.

@carl-baillargeon
Copy link
Contributor

VerifyRoutingProtocolModel is being run on devices that do not have BGP configuration which is not the actual behavior of eos_validate_state.

@carl-baillargeon
Copy link
Contributor

When the devices are not present, I see that the eos_validate_state was successful , even though these devices are not reachable. I would expect that the task Run eos_validate_state_runner leveraging ANTA would fail for unreachable devices.

@chetryan Fixed in dd1c8c0

The task should fail and stop the play if the device is unreachable or there is a connection problem (e.g. wrong password or SSL problem).

Also, if for some reason we lose connectivity to the device during the tests, you should see a log with the failed commands and the tests associated with the failed commands will be marked as FAIL in the report.

It would be great if you can confirm these behaviors on your end.

Thanks

Copy link
Member

@carlbuchmann carlbuchmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great job @carl-baillargeon @gmuloc

@gmuloc
Copy link
Contributor Author

gmuloc commented Oct 2, 2023

Follow up to this PR is tracked in avd-internal repo: https://github.com/aristanetworks/avd-internal/issues/119

@ClausHolbechArista ClausHolbechArista merged commit d390258 into aristanetworks:devel Oct 2, 2023
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EPIC - eos_validate_state ANTA rn: Feat(eos_validate_state) role: build_output_folders issue related to build_output_folders role role: eos_validate_state state: CI Updated CI scenario have been updated in the PR state: Documentation role Updated type: code quality CI and development toolset type: documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feat(eos_validate_state): Detailed report for tests to be executed.
5 participants