Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] v3 multi-threading fails due to user cache #44

Closed
mtuska opened this issue Mar 1, 2024 · 10 comments · Fixed by #45
Closed

[BUG] v3 multi-threading fails due to user cache #44

mtuska opened this issue Mar 1, 2024 · 10 comments · Fixed by #45
Assignees
Labels
bug Something isn't working

Comments

@mtuska
Copy link

mtuska commented Mar 1, 2024

EzSNMP release version OR commit number
1.0.0c

Operating System and Version

  • OS: Ubuntu
  • Version 22

Net-SNMP Library Version

  • 5.9.1

Describe the bug
When using multiple threads with snmpv3, the user cache will be cleared when one of the sessions is ended thus causing the other threads to fail.

To Reproduce
The pytest I've built to replicate this issue:

test_threading.py

from __future__ import unicode_literals

import platform
import re

import pytest
from ezsnmp.exceptions import (
    EzSNMPError,
    EzSNMPConnectionError,
    EzSNMPTimeoutError,
    EzSNMPNoSuchObjectError,
    EzSNMPNoSuchInstanceError,
    EzSNMPNoSuchNameError,
)

from ezsnmp.session import Session
import concurrent.futures

@pytest.mark.parametrize("workers", [1, 8, 16])
@pytest.mark.parametrize("jobs", [16])
def test_session_threaded(sess_args, workers, jobs):
    def do_work(sess_args):
        sess = Session(**sess_args)
        res = sess.get([("sysUpTime", "0"), ("sysContact", "0"), ("sysLocation", "0")])

        assert len(res) == 3

        assert res[0].oid == "sysUpTimeInstance"
        assert res[0].oid_index == ""
        assert int(res[0].value) > 0
        assert res[0].snmp_type == "TICKS"

        assert res[1].oid == "sysContact"
        assert res[1].oid_index == "0"
        assert res[1].value == "G. S. Marzot <gmarzot@marzot.net>"
        assert res[1].snmp_type == "OCTETSTR"

        assert res[2].oid == "sysLocation"
        assert res[2].oid_index == "0"
        assert res[2].value == "my original location"
        assert res[2].snmp_type == "OCTETSTR"
    with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
        futures = [executor.submit(do_work, sess_args) for _ in range(jobs)]
        for future in futures:
            future.result()

When it's 1 worker it'll pass without issue, but the 8/16 workers will fail.
To remediate the issue, I was able to comment out the call to "__remove_user_from_cache" on line 1611 to have the test pass. Probably not the ideal fix, but seems to confirm the issue is related to the user caching.

Expected behavior
SNMPv3 sessions work in multiple threads.

@mtuska mtuska added the bug Something isn't working label Mar 1, 2024
@carlkidcrypto
Copy link
Owner

@mtuska Thanks for reporting this. I'll try to take a look at this and resolve it. Feel free to open up a PR if you find a solution before I do.

@carlkidcrypto carlkidcrypto self-assigned this Mar 2, 2024
@carlkidcrypto
Copy link
Owner

carlkidcrypto commented Mar 3, 2024

@mtuska If you are comfortable installing code from source give this branch a try: 44-bug-v3-multi-threading-fails-due-to-user-cache. If it still doesn't work for you can you paste some of the Python Stack Trace Errors ?

@carlkidcrypto carlkidcrypto linked a pull request Mar 3, 2024 that will close this issue
@mtuska
Copy link
Author

mtuska commented Mar 3, 2024

Wow you were quick on this, I'll give it a run through on Monday morning. Additionally, I've seen a strange issue with snmpv3 usmStatsNotInTimeWindows when calling different hosts but I'll open a new issue for that. Need to try and find/build a reproduction of that issue yet.

@mtuska
Copy link
Author

mtuska commented Mar 4, 2024

After some real world tests(maybe should add artificial sleep time to the threading test) the branch appears to be single threaded as the responses are all 1 at a time from what I'm witnessing. This will query 414 devices(some v2, some v3; v3 creds are shared) with 64 threads.

Time with removal of clearing cache in the main branch(main):
real 1m4.800s
user 0m36.031s
sys 0m4.493s

Time with suggested branch(44-bug-v3-multi-threading-fails-due-to-user-cache) until segmentation fault:
real 3m21.835s
user 0m0.305s
sys 0m0.036s

Thread 65 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff15ffb640 (LWP 1232404)]
__strcmp_evex () at ../sysdeps/x86_64/multiarch/strcmp-evex.S:139
139 ../sysdeps/x86_64/multiarch/strcmp-evex.S: No such file or directory.
(gdb) bt
#0 __strcmp_evex () at ../sysdeps/x86_64/multiarch/strcmp-evex.S:139
#1 0x00007ffff60ce5e4 in __remove_user_from_cache (ss=) at ezsnmp/interface.cpp:1389
#2 0x00007ffff60ce67b in delete_session_capsule (session_capsule=) at ezsnmp/interface.cpp:1621
#3 0x00005555557ba3dd in capsule_dealloc (o=<PyCapsule at remote 0x7ffff4773ab0>) at ../Objects/capsule.c:261
#4 0x000055555568a8b3 in _Py_Dealloc (op=) at ../Objects/object.c:2301

@mtuska
Copy link
Author

mtuska commented Mar 4, 2024

Oh after a bit of investigating I recall I made another fix for this segmentation fault. The issue is from running both v1/v2 and v3 in the same process, as both will call over to __remove_user_from_cache(think the test didn't display this issue, since it should go v1,v2c, then v3. versus real world is more random) this would require additional checking. In my local edits of easysnmp I had added checks for a valid actUser:

static void __remove_user_from_cache(struct session_list *ss)
{
    struct usmUser *actUser = usm_get_userList();
    struct usmUser *nextUser = NULL;
    while (actUser != NULL)
    {
        nextUser = actUser->next;
        if (
+            actUser->secName != NULL && ss->session->securityName != NULL &&
+            actUser->engineID != NULL && ss->session->contextEngineID != NULL &&
            strcmp((const char *)actUser->secName, (const char *)ss->session->securityName) == 0 &&
            strcmp((const char *)actUser->engineID, (const char *)ss->session->contextEngineID) == 0)
        {
            usm_remove_user(actUser);
            usm_free_user(actUser);
        }
        actUser = nextUser;
    }
}

Currently rerunning it to double check the fault is removed via this. Will take a while with my devices due to the single-threaded nature of the branch.

Edit:
During the re-running, I am seeing many "ezsnmp.exceptions.EzSNMPError: USM unknown security name (no such user exists)" for devices that were previously working without issue.

Final time for branch with the above edit to avoid the segmentation fault:
real 41m48.719s
user 0m25.336s
sys 0m2.665s

@carlkidcrypto
Copy link
Owner

carlkidcrypto commented Mar 4, 2024

Oh after a bit of investigating I recall I made another fix for this segmentation fault. The issue is from running both v1/v2 and v3 in the same process, as both will call over to __remove_user_from_cache(think the test didn't display this issue, since it should go v1,v2c, then v3. versus real world is more random) this would require additional checking. In my local edits of easysnmp I had added checks for a valid actUser:

static void __remove_user_from_cache(struct session_list *ss)
{
    struct usmUser *actUser = usm_get_userList();
    struct usmUser *nextUser = NULL;
    while (actUser != NULL)
    {
        nextUser = actUser->next;
        if (
+            actUser->secName != NULL && ss->session->securityName != NULL &&
+            actUser->engineID != NULL && ss->session->contextEngineID != NULL &&
            strcmp((const char *)actUser->secName, (const char *)ss->session->securityName) == 0 &&
            strcmp((const char *)actUser->engineID, (const char *)ss->session->contextEngineID) == 0)
        {
            usm_remove_user(actUser);
            usm_free_user(actUser);
        }
        actUser = nextUser;
    }
}

Currently rerunning it to double check the fault is removed via this. Will take a while with my devices due to the single-threaded nature of the branch.

Edit:
During the re-running, I am seeing many "ezsnmp.exceptions.EzSNMPError: USM unknown security name (no such user exists)" for devices that were previously working without issue.

Final time for branch with the above edit to avoid the segmentation fault:
real 41m48.719s
user 0m25.336s
sys 0m2.665s

Ahh you're right it does go in order. v1, v2c etc. Let me add a test case that's more "real world".

And I'll pull in your version of __remove_user_from_cache.

Edit:

Tests have been added to try and mimic "real word" scenarios. The Linux tests all pass, but looks like the MacOS tests are showing the "No Such User" exists error - https://github.com/carlkidcrypto/ezsnmp/actions/runs/8151441707/job/22279325545

Based on previous comments it is suspected to be a SNMPD bug... hmm.

@carlkidcrypto
Copy link
Owner

carlkidcrypto commented Mar 5, 2024

@mtuska are you seeing the "No Such User" exists errors on certain devices using a certain snmp version like v1, v2, etc ? Would you happen to know what SNMPD version the end devices are using?
Also, try once more with what's on the bugfix branch 👍🏽

@mtuska
Copy link
Author

mtuska commented Mar 5, 2024

I'm still seeing it act single threaded, and it appears both v2 and v3 are failing for the most part. For my sanity sake I commented out the std::lock_guard and __remove_user_from_cache calls, and afterwards I'm not single threaded anymore and calls are being made successfully.
The "USM unknown security name (no such user exists)" is failing for v3 clients.

@carlkidcrypto
Copy link
Owner

I'm still seeing it act single threaded, and it appears both v2 and v3 are failing for the most part. For my sanity sake I commented out the std::lock_guard and __remove_user_from_cache calls, and afterwards I'm not single threaded anymore and calls are being made successfully.
The "USM unknown security name (no such user exists)" is failing for v3 clients.

Thanks for the update. Are you able to attach the Python script you are using? I want to understand both the issue and use case. It'll take some time.

@mtuska
Copy link
Author

mtuska commented Mar 6, 2024

I've been using my own wrapper to abstract the actual snmp classes and build a mib helper so I can easily transition between libraries, but I made a shorter version without it for this issue.
My use case for this current work is network devices, specifically Cisco and Juniper devices.

import threading
import mysql.connector
import ezsnmp
import concurrent.futures
import traceback
import re
import ezsnmp.exceptions
from typing import TYPE_CHECKING, Any, Dict, NoReturn, Optional, Tuple, Type, Union
from mysql.connector.pooling import PooledMySQLConnection
from mysql.connector.abstracts import MySQLConnectionAbstract

db_config = {
    'user': '',
    'password': '',
    'host': 'localhost',
    'database': '',
    'raise_on_warnings': True
}

_tls = threading.local()
def get_db() -> Union[PooledMySQLConnection, MySQLConnectionAbstract]:
    if getattr(_tls, 'db_conn', None) is None:
        _tls.db_conn = mysql.connector.connect(**db_config)
    return _tls.db_conn

def process_host(hostname, ip_address, vendor):
    try:
        print(hostname, ip_address, vendor)
        device = get_device(ip_address, vendor)
        device.get([".1.3.6.1.2.1.1.1", ".1.3.6.1.2.1.1.2"])
        # desc = device.get(snmp_util.mibs.snmpv2Mib.sysDescr).get()
        # objId = device.get(snmp_util.mibs.snmpv2Mib.sysObjectID).get()
    except ezsnmp.exceptions.EzSNMPTimeoutError as e:
        traceback.print_exception(e)
        print(f"Timed out while connecting to {hostname} with IP {ip_address}: {e}")
    except Exception as e:
        traceback.print_exception(e)
        print(f"An error occurred while processing device {hostname} with IP {ip_address}: {e}")

def get_device(address, vendor):
    if vendor == "Cisco":
        return ezsnmp.Session(**{
            "version": 2,
            "hostname": address,
            "remote_port": 161,
            "community": "public",
            "timeout": 5,
            "retries": 3,
            "use_numeric": True,
        })
        # return snmp_util.v2(address, "public", timeout=5)
    if vendor == "Juniper":
        return ezsnmp.Session(**{
            "version": 3,
            "hostname": address,
            "remote_port": 161,
            "security_level": "authPriv",
            "security_username": "identity",
            "timeout": 5,
            "retries": 3,
            "use_numeric": True,
            "auth_protocol": "SHA",
            "auth_password": "supersecret",
            "privacy_protocol": "AES",
            "privacy_password": "supersecret",
        })
        # return snmp_util.v3(address, "identity", "super", snmp_util.v3.AuthMethod.SHA1, "secret", snmp_util.v3.PrivMethod.AES128, timeout=5)

db = get_db()
cursor = db.cursor()

cursor.execute("SELECT hostname, ip_address, vendor FROM host_inventory WHERE vendor IN ('Cisco','Juniper')")
devices = cursor.fetchall()
cursor.close()

with concurrent.futures.ThreadPoolExecutor(max_workers=64) as executor:
    futures = [executor.submit(process_host, hostname, ip_address, vendor) for hostname, ip_address, vendor in devices]
    for future in concurrent.futures.as_completed(futures):
        future.result()

carlkidcrypto added a commit that referenced this issue Mar 10, 2024
- this adds integration tests per bug #44
- this adds integrations tests to the tests.yml file as well.
- multi proc still having issues with USM errors.
carlkidcrypto added a commit that referenced this issue Mar 10, 2024
* Create test_multi_threading.py

- added a test per what @mtuska posted in the bug.

- adding a sleep "fixes" the issue. I think we may have a possible resource contention problem.

* Mutex

- adding a mutex for the protection of the `__remove_user_from_cache`

- updated tests. Timeout errors are normal IMHO since we are hammering the snmp server.

* Another Mutex

- Fixed another issues that arose with multi-threading. Was seeing inconsistent failures of PDUs.

* Flake

- fixing flake errors
- letting req text file be the source of truth for the "Install pip dependencies" step.

* test.yml: Pytest coverage comment step

- the "Pytest coverage comment" seems to be broken. Attempting to add the path.

* Update tests.yml

- debug getting info out of the tests.yml.
Step "Create multi-file output listing" may not be working as expected.
- quick attempt to fix MacOS tests. All but one test cases fail.

* Update tests.yml

okay, that made the step for macos fail. reverting change.

* MishaKav/pytest-coverage-comment

- hmm I think the latest update to this broke stuff. Let me try reverting versions.

* OOps

- deleted code on accident. Revert that and still use 1.1.50 of pytest-coverage-comment.

* /pytest-coverage-comment@v1.1.51

- back to /pytest-coverage-comment@v1.1.51
no change. Turns out I think all tests have to pass in order for coverage to work...

* Added More Tests

- fixed the null check thanks again @mtuska
- added mutli proc tests as well. still not seeing the issues via TDD.

* Adding more jobs

- added more jobs to tests
- revert a move of the null's, valgrind complaints grew...

* Update test_multi_proc_thread.py

- black .

* Bump to version 1.0.0d

- okay this is running and passing on my MacOS 14.13.1 using python3. Not sure what's up with the github actions image of macos.

* Oops Bad Release Version

- fixing the bad release version. Looks like we can use letters anymore. I'll have to document my process. Let's stick to
1.0.0.a0 -aX for alpha builds
1.0.0.b0 - bX for beta builds
1.0;0.c0 - cX for release candidate builds.

* Threads

- this get's threads to work. The integrations tests for threads all pass.
- this get's multi process to mostly work. They occasionally error out with "USM unknown security name (no such user exists)" or "no such name error encountered"

* Adding Integration Test

- this adds integration tests per bug #44
- this adds integrations tests to the tests.yml file as well.
- multi proc still having issues with USM errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants