[BUG] v3 multi-threading fails due to user cache #44

mtuska · 2024-03-01T16:50:37Z

EzSNMP release version OR commit number
1.0.0c

Operating System and Version

OS: Ubuntu
Version 22

Net-SNMP Library Version

5.9.1

Describe the bug
When using multiple threads with snmpv3, the user cache will be cleared when one of the sessions is ended thus causing the other threads to fail.

To Reproduce
The pytest I've built to replicate this issue:

test_threading.py

from __future__ import unicode_literals

import platform
import re

import pytest
from ezsnmp.exceptions import (
    EzSNMPError,
    EzSNMPConnectionError,
    EzSNMPTimeoutError,
    EzSNMPNoSuchObjectError,
    EzSNMPNoSuchInstanceError,
    EzSNMPNoSuchNameError,
)

from ezsnmp.session import Session
import concurrent.futures

@pytest.mark.parametrize("workers", [1, 8, 16])
@pytest.mark.parametrize("jobs", [16])
def test_session_threaded(sess_args, workers, jobs):
    def do_work(sess_args):
        sess = Session(**sess_args)
        res = sess.get([("sysUpTime", "0"), ("sysContact", "0"), ("sysLocation", "0")])

        assert len(res) == 3

        assert res[0].oid == "sysUpTimeInstance"
        assert res[0].oid_index == ""
        assert int(res[0].value) > 0
        assert res[0].snmp_type == "TICKS"

        assert res[1].oid == "sysContact"
        assert res[1].oid_index == "0"
        assert res[1].value == "G. S. Marzot <gmarzot@marzot.net>"
        assert res[1].snmp_type == "OCTETSTR"

        assert res[2].oid == "sysLocation"
        assert res[2].oid_index == "0"
        assert res[2].value == "my original location"
        assert res[2].snmp_type == "OCTETSTR"
    with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
        futures = [executor.submit(do_work, sess_args) for _ in range(jobs)]
        for future in futures:
            future.result()

When it's 1 worker it'll pass without issue, but the 8/16 workers will fail.
To remediate the issue, I was able to comment out the call to "__remove_user_from_cache" on line 1611 to have the test pass. Probably not the ideal fix, but seems to confirm the issue is related to the user caching.

Expected behavior
SNMPv3 sessions work in multiple threads.

The text was updated successfully, but these errors were encountered:

carlkidcrypto · 2024-03-02T16:01:21Z

@mtuska Thanks for reporting this. I'll try to take a look at this and resolve it. Feel free to open up a PR if you find a solution before I do.

carlkidcrypto · 2024-03-03T01:05:01Z

@mtuska If you are comfortable installing code from source give this branch a try: 44-bug-v3-multi-threading-fails-due-to-user-cache. If it still doesn't work for you can you paste some of the Python Stack Trace Errors ?

mtuska · 2024-03-03T03:41:33Z

Wow you were quick on this, I'll give it a run through on Monday morning. Additionally, I've seen a strange issue with snmpv3 usmStatsNotInTimeWindows when calling different hosts but I'll open a new issue for that. Need to try and find/build a reproduction of that issue yet.

mtuska · 2024-03-04T16:06:29Z

After some real world tests(maybe should add artificial sleep time to the threading test) the branch appears to be single threaded as the responses are all 1 at a time from what I'm witnessing. This will query 414 devices(some v2, some v3; v3 creds are shared) with 64 threads.

Time with removal of clearing cache in the main branch(main):
real 1m4.800s
user 0m36.031s
sys 0m4.493s

Time with suggested branch(44-bug-v3-multi-threading-fails-due-to-user-cache) until segmentation fault:
real 3m21.835s
user 0m0.305s
sys 0m0.036s

Thread 65 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff15ffb640 (LWP 1232404)]
__strcmp_evex () at ../sysdeps/x86_64/multiarch/strcmp-evex.S:139
139 ../sysdeps/x86_64/multiarch/strcmp-evex.S: No such file or directory.
(gdb) bt
#0 __strcmp_evex () at ../sysdeps/x86_64/multiarch/strcmp-evex.S:139
#1 0x00007ffff60ce5e4 in __remove_user_from_cache (ss=) at ezsnmp/interface.cpp:1389
#2 0x00007ffff60ce67b in delete_session_capsule (session_capsule=) at ezsnmp/interface.cpp:1621
#3 0x00005555557ba3dd in capsule_dealloc (o=<PyCapsule at remote 0x7ffff4773ab0>) at ../Objects/capsule.c:261
#4 0x000055555568a8b3 in _Py_Dealloc (op=) at ../Objects/object.c:2301

mtuska · 2024-03-04T16:20:30Z

Oh after a bit of investigating I recall I made another fix for this segmentation fault. The issue is from running both v1/v2 and v3 in the same process, as both will call over to __remove_user_from_cache(think the test didn't display this issue, since it should go v1,v2c, then v3. versus real world is more random) this would require additional checking. In my local edits of easysnmp I had added checks for a valid actUser:

static void __remove_user_from_cache(struct session_list *ss)
{
    struct usmUser *actUser = usm_get_userList();
    struct usmUser *nextUser = NULL;
    while (actUser != NULL)
    {
        nextUser = actUser->next;
        if (
+            actUser->secName != NULL && ss->session->securityName != NULL &&
+            actUser->engineID != NULL && ss->session->contextEngineID != NULL &&
            strcmp((const char *)actUser->secName, (const char *)ss->session->securityName) == 0 &&
            strcmp((const char *)actUser->engineID, (const char *)ss->session->contextEngineID) == 0)
        {
            usm_remove_user(actUser);
            usm_free_user(actUser);
        }
        actUser = nextUser;
    }
}

Currently rerunning it to double check the fault is removed via this. Will take a while with my devices due to the single-threaded nature of the branch.

Edit:
During the re-running, I am seeing many "ezsnmp.exceptions.EzSNMPError: USM unknown security name (no such user exists)" for devices that were previously working without issue.

Final time for branch with the above edit to avoid the segmentation fault:
real 41m48.719s
user 0m25.336s
sys 0m2.665s

carlkidcrypto · 2024-03-04T22:07:34Z

Oh after a bit of investigating I recall I made another fix for this segmentation fault. The issue is from running both v1/v2 and v3 in the same process, as both will call over to __remove_user_from_cache(think the test didn't display this issue, since it should go v1,v2c, then v3. versus real world is more random) this would require additional checking. In my local edits of easysnmp I had added checks for a valid actUser:
static void __remove_user_from_cache(struct session_list *ss)
{
    struct usmUser *actUser = usm_get_userList();
    struct usmUser *nextUser = NULL;
    while (actUser != NULL)
    {
        nextUser = actUser->next;
        if (
+            actUser->secName != NULL && ss->session->securityName != NULL &&
+            actUser->engineID != NULL && ss->session->contextEngineID != NULL &&
            strcmp((const char *)actUser->secName, (const char *)ss->session->securityName) == 0 &&
            strcmp((const char *)actUser->engineID, (const char *)ss->session->contextEngineID) == 0)
        {
            usm_remove_user(actUser);
            usm_free_user(actUser);
        }
        actUser = nextUser;
    }
}
Currently rerunning it to double check the fault is removed via this. Will take a while with my devices due to the single-threaded nature of the branch.

Edit:
During the re-running, I am seeing many "ezsnmp.exceptions.EzSNMPError: USM unknown security name (no such user exists)" for devices that were previously working without issue.

Final time for branch with the above edit to avoid the segmentation fault:
real 41m48.719s
user 0m25.336s
sys 0m2.665s

Ahh you're right it does go in order. v1, v2c etc. Let me add a test case that's more "real world".

And I'll pull in your version of __remove_user_from_cache.

Edit:

Tests have been added to try and mimic "real word" scenarios. The Linux tests all pass, but looks like the MacOS tests are showing the "No Such User" exists error - https://github.com/carlkidcrypto/ezsnmp/actions/runs/8151441707/job/22279325545

Based on previous comments it is suspected to be a SNMPD bug... hmm.

carlkidcrypto · 2024-03-05T15:15:26Z

@mtuska are you seeing the "No Such User" exists errors on certain devices using a certain snmp version like v1, v2, etc ? Would you happen to know what SNMPD version the end devices are using?
Also, try once more with what's on the bugfix branch 👍🏽

mtuska · 2024-03-05T20:05:03Z

I'm still seeing it act single threaded, and it appears both v2 and v3 are failing for the most part. For my sanity sake I commented out the std::lock_guard and __remove_user_from_cache calls, and afterwards I'm not single threaded anymore and calls are being made successfully.
The "USM unknown security name (no such user exists)" is failing for v3 clients.

carlkidcrypto · 2024-03-05T21:15:16Z

I'm still seeing it act single threaded, and it appears both v2 and v3 are failing for the most part. For my sanity sake I commented out the std::lock_guard and __remove_user_from_cache calls, and afterwards I'm not single threaded anymore and calls are being made successfully.
The "USM unknown security name (no such user exists)" is failing for v3 clients.

Thanks for the update. Are you able to attach the Python script you are using? I want to understand both the issue and use case. It'll take some time.

mtuska · 2024-03-06T13:41:53Z

I've been using my own wrapper to abstract the actual snmp classes and build a mib helper so I can easily transition between libraries, but I made a shorter version without it for this issue.
My use case for this current work is network devices, specifically Cisco and Juniper devices.

import threading
import mysql.connector
import ezsnmp
import concurrent.futures
import traceback
import re
import ezsnmp.exceptions
from typing import TYPE_CHECKING, Any, Dict, NoReturn, Optional, Tuple, Type, Union
from mysql.connector.pooling import PooledMySQLConnection
from mysql.connector.abstracts import MySQLConnectionAbstract

db_config = {
    'user': '',
    'password': '',
    'host': 'localhost',
    'database': '',
    'raise_on_warnings': True
}

_tls = threading.local()
def get_db() -> Union[PooledMySQLConnection, MySQLConnectionAbstract]:
    if getattr(_tls, 'db_conn', None) is None:
        _tls.db_conn = mysql.connector.connect(**db_config)
    return _tls.db_conn

def process_host(hostname, ip_address, vendor):
    try:
        print(hostname, ip_address, vendor)
        device = get_device(ip_address, vendor)
        device.get([".1.3.6.1.2.1.1.1", ".1.3.6.1.2.1.1.2"])
        # desc = device.get(snmp_util.mibs.snmpv2Mib.sysDescr).get()
        # objId = device.get(snmp_util.mibs.snmpv2Mib.sysObjectID).get()
    except ezsnmp.exceptions.EzSNMPTimeoutError as e:
        traceback.print_exception(e)
        print(f"Timed out while connecting to {hostname} with IP {ip_address}: {e}")
    except Exception as e:
        traceback.print_exception(e)
        print(f"An error occurred while processing device {hostname} with IP {ip_address}: {e}")

def get_device(address, vendor):
    if vendor == "Cisco":
        return ezsnmp.Session(**{
            "version": 2,
            "hostname": address,
            "remote_port": 161,
            "community": "public",
            "timeout": 5,
            "retries": 3,
            "use_numeric": True,
        })
        # return snmp_util.v2(address, "public", timeout=5)
    if vendor == "Juniper":
        return ezsnmp.Session(**{
            "version": 3,
            "hostname": address,
            "remote_port": 161,
            "security_level": "authPriv",
            "security_username": "identity",
            "timeout": 5,
            "retries": 3,
            "use_numeric": True,
            "auth_protocol": "SHA",
            "auth_password": "supersecret",
            "privacy_protocol": "AES",
            "privacy_password": "supersecret",
        })
        # return snmp_util.v3(address, "identity", "super", snmp_util.v3.AuthMethod.SHA1, "secret", snmp_util.v3.PrivMethod.AES128, timeout=5)

db = get_db()
cursor = db.cursor()

cursor.execute("SELECT hostname, ip_address, vendor FROM host_inventory WHERE vendor IN ('Cisco','Juniper')")
devices = cursor.fetchall()
cursor.close()

with concurrent.futures.ThreadPoolExecutor(max_workers=64) as executor:
    futures = [executor.submit(process_host, hostname, ip_address, vendor) for hostname, ip_address, vendor in devices]
    for future in concurrent.futures.as_completed(futures):
        future.result()

- this adds integration tests per bug #44 - this adds integrations tests to the tests.yml file as well. - multi proc still having issues with USM errors.

@mtuska

* Create test_multi_threading.py - added a test per what @mtuska posted in the bug. - adding a sleep "fixes" the issue. I think we may have a possible resource contention problem. * Mutex - adding a mutex for the protection of the `__remove_user_from_cache` - updated tests. Timeout errors are normal IMHO since we are hammering the snmp server. * Another Mutex - Fixed another issues that arose with multi-threading. Was seeing inconsistent failures of PDUs. * Flake - fixing flake errors - letting req text file be the source of truth for the "Install pip dependencies" step. * test.yml: Pytest coverage comment step - the "Pytest coverage comment" seems to be broken. Attempting to add the path. * Update tests.yml - debug getting info out of the tests.yml. Step "Create multi-file output listing" may not be working as expected. - quick attempt to fix MacOS tests. All but one test cases fail. * Update tests.yml okay, that made the step for macos fail. reverting change. * MishaKav/pytest-coverage-comment - hmm I think the latest update to this broke stuff. Let me try reverting versions. * OOps - deleted code on accident. Revert that and still use 1.1.50 of pytest-coverage-comment. * /pytest-coverage-comment@v1.1.51 - back to /pytest-coverage-comment@v1.1.51 no change. Turns out I think all tests have to pass in order for coverage to work... * Added More Tests - fixed the null check thanks again @mtuska - added mutli proc tests as well. still not seeing the issues via TDD. * Adding more jobs - added more jobs to tests - revert a move of the null's, valgrind complaints grew... * Update test_multi_proc_thread.py - black . * Bump to version 1.0.0d - okay this is running and passing on my MacOS 14.13.1 using python3. Not sure what's up with the github actions image of macos. * Oops Bad Release Version - fixing the bad release version. Looks like we can use letters anymore. I'll have to document my process. Let's stick to 1.0.0.a0 -aX for alpha builds 1.0.0.b0 - bX for beta builds 1.0;0.c0 - cX for release candidate builds. * Threads - this get's threads to work. The integrations tests for threads all pass. - this get's multi process to mostly work. They occasionally error out with "USM unknown security name (no such user exists)" or "no such name error encountered" * Adding Integration Test - this adds integration tests per bug #44 - this adds integrations tests to the tests.yml file as well. - multi proc still having issues with USM errors.

mtuska added the bug Something isn't working label Mar 1, 2024

carlkidcrypto self-assigned this Mar 2, 2024

carlkidcrypto linked a pull request Mar 3, 2024 that will close this issue

44 bug v3 multi threading fails due to user cache #45

Merged

carlkidcrypto added a commit that referenced this issue Mar 10, 2024

Adding Integration Test

ca9f5b7

- this adds integration tests per bug #44 - this adds integrations tests to the tests.yml file as well. - multi proc still having issues with USM errors.

carlkidcrypto closed this as completed in #45 Mar 10, 2024

carlkidcrypto mentioned this issue Mar 10, 2024

[BUG] - EzSNMPError: USM unknown security name (no such user exists) #50

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] v3 multi-threading fails due to user cache #44

[BUG] v3 multi-threading fails due to user cache #44

mtuska commented Mar 1, 2024 •

edited

Loading

carlkidcrypto commented Mar 2, 2024

carlkidcrypto commented Mar 3, 2024 •

edited

Loading

mtuska commented Mar 3, 2024

mtuska commented Mar 4, 2024 •

edited

Loading

mtuska commented Mar 4, 2024 •

edited

Loading

carlkidcrypto commented Mar 4, 2024 •

edited

Loading

carlkidcrypto commented Mar 5, 2024 •

edited

Loading

mtuska commented Mar 5, 2024

carlkidcrypto commented Mar 5, 2024

mtuska commented Mar 6, 2024 •

edited

Loading

[BUG] v3 multi-threading fails due to user cache #44

[BUG] v3 multi-threading fails due to user cache #44

Comments

mtuska commented Mar 1, 2024 • edited Loading

carlkidcrypto commented Mar 2, 2024

carlkidcrypto commented Mar 3, 2024 • edited Loading

mtuska commented Mar 3, 2024

mtuska commented Mar 4, 2024 • edited Loading

mtuska commented Mar 4, 2024 • edited Loading

carlkidcrypto commented Mar 4, 2024 • edited Loading

carlkidcrypto commented Mar 5, 2024 • edited Loading

mtuska commented Mar 5, 2024

carlkidcrypto commented Mar 5, 2024

mtuska commented Mar 6, 2024 • edited Loading

mtuska commented Mar 1, 2024 •

edited

Loading

carlkidcrypto commented Mar 3, 2024 •

edited

Loading

mtuska commented Mar 4, 2024 •

edited

Loading

mtuska commented Mar 4, 2024 •

edited

Loading

carlkidcrypto commented Mar 4, 2024 •

edited

Loading

carlkidcrypto commented Mar 5, 2024 •

edited

Loading

mtuska commented Mar 6, 2024 •

edited

Loading