Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious ZMQ disconnects in case of oversubscribe CPU #636

Closed
hhaim opened this issue Feb 17, 2021 · 14 comments
Closed

Spurious ZMQ disconnects in case of oversubscribe CPU #636

hhaim opened this issue Feb 17, 2021 · 14 comments
Assignees

Comments

@hhaim
Copy link
Contributor

hhaim commented Feb 17, 2021

The issue: Python client reports as disconnected while the server is up. re-connect is possible.

The reason: ZMQ SUB/PUB report events to the clients, some of the events are critical for the state of the client so it must be reliable while SUB/PUB can discard messages

How to reproduce: run iperf3 or stress-ng on the machine and attach it to master (taskset) core and connect with 1000 threads from a remote

The solution: in case of interactive (STL/ASTF) save the events into a queue and add an RPC to pool the the queue events. STF will keep the same interface. The MAJOR version was incremented so only latest clients could work with the new server. (GUI need to be updated)

@hhaim
Copy link
Contributor Author

hhaim commented Feb 19, 2021

@jsmoon please review

@jsmoon
Copy link
Contributor

jsmoon commented Feb 23, 2021

@hhaim looks good. I think this is related to #336 and I'll close it now also.

For your information, I've been used to check the event messages from the additional trex-console client.
Now I can't do that. But it can be negligible.

And I have a minor suggestion regarding this change.
There are so many messages when verbose is on in the trex-console. Why don't you suppress the messages?

@hhaim
Copy link
Contributor Author

hhaim commented Feb 24, 2021

@jsmoon I didn't realize that my patch solve #366. now all the events are in any way in sync beacuse the async is reading from a queue which is reliable. async is just an interrupt to read the queue (per session).
Your point about another trex-console (readonly) is good. it is a minor fix to make it work as any REQ/RES will get a new session/new queue, it is just not enabled.

In regard to the verbose mode, what happen before this patch? I don't expect so many events with small number of profiles. could you elaborate your suggestion?

@jsmoon
Copy link
Contributor

jsmoon commented Feb 24, 2021

@hhaim I got the following message when I set verbose on in the trex-console.

[verbose] Sending Request To Server:

{
    "id": "0fr9t8gl",
    "jsonrpc": "2.0",
    "method": "get_async_events",
    "params": {
        "api_h": "zIyjdcSp",
        "session_id": 2884486712,
        "user": "sub"
    }
}


trex>



[verbose] Server Response:

{
    "id": "0fr9t8gl",
    "jsonrpc": "2.0",
    "result": {
        "data": []
    }
}

The rate is about 2 times per second. Is there any periodic event on the SUB/PUB channel?

@hhaim
Copy link
Contributor Author

hhaim commented Feb 24, 2021

@jsmoon understood now, in case of SUB/PUB timeout (500msec) there is a read from the queue too. maybe in case of debug we can disable the call in case of timeout

@hhaim
Copy link
Contributor Author

hhaim commented Feb 24, 2021

Could you check if this helps

index d2975bd8..534a4ce5 100644                                                                                     
--- a/scripts/automation/trex_control_plane/interactive/trex/common/trex_subscriber.py                              
+++ b/scripts/automation/trex_control_plane/interactive/trex/common/trex_subscriber.py                              
@@ -296,10 +296,11 @@ class TRexSubscriber():
     def _wait_for_trigger (self):
             try:
                self.socket.recv() # block on event
+               return True 
             except zmq.Again:
-                pass
+                return False
             except zmq.ContextTerminated:
-                pass
+                return False
 
     # thread function
     def _run (self):
@@ -310,7 +311,9 @@ class TRexSubscriber():
         self.seq =0
 
         while self.t_state != self.THREAD_STATE_DEAD:
-            self._wait_for_trigger()
+            res = self._wait_for_trigger()
+            if (res == False) and (self.ctx.get_verbose() == 'debug'):
+                continue  
             if self.rpc.is_connected():
                 rc = self.rpc.transmit("get_async_events", params = {'session_id' :self.session_id , "user":"sub"} )
 

@jsmoon
Copy link
Contributor

jsmoon commented Feb 24, 2021

@hhaim it doesn't work. Each time, the following event message is received periodically.

b'{"name":"tx-gen","type":0,"data":{"realtime-hist":{"min_usec":10,"max_usec":0,"high_cnt":0,"cnt":0,"s_avg":0.000000,"s_max":0.000000, "histogram": [  ] } ,"unknown":0}}'

I found that CGlobalTRex::handle_slow_path() publishes async data periodically every 500 msec.

@hhaim
Copy link
Contributor Author

hhaim commented Feb 24, 2021

events should have value["name"] = "trex-event"; is it a response for request?

    Json::Value value;
    
    value["name"] = "trex-event";
    value["type"] = type;
    value["data"] = data;

@jsmoon
Copy link
Contributor

jsmoon commented Feb 25, 2021

@hhaim no, it's one of the periodic messages generated by the following routine.

COLD_FUNC void
CGlobalTRex::publish_async_data(bool sync_now, bool baseline) {
    std::string json;
  
    if (sync_now) {
        sync_threads_stats();
    }

    /* common stats */
    m_stats.dump_json(json, baseline);
    m_zmq_publisher.publish_json(json);

    /* generator json , all cores are the same just sample the first one */
    m_fl.m_threads_info[0]->m_node_gen.dump_json(json);                           // <-- "tx-gen" generated here
    m_zmq_publisher.publish_json(json);
    
    /* config specific stats */
    m_stx->publish_async_data();
}

@hhaim
Copy link
Contributor Author

hhaim commented Feb 25, 2021

@jsmoon, I missed removing this, it is not needed in interactive mode.

@jsmoon
Copy link
Contributor

jsmoon commented Feb 25, 2021

@hhaim, ok.

@hhaim
Copy link
Contributor Author

hhaim commented Mar 3, 2021

@jsmoon I've added a fix to the async periodic messages in case of verbose.
I've open a issue to support async events in case of read-only console too
#646
(There is a need to add/change the acquire/release, see more in the issue ) . I think it is important to fix this.
Thanks for raising this

@jsmoon
Copy link
Contributor

jsmoon commented Mar 3, 2021

@hhaim, ok. I'll keep on watching it.

@hhaim
Copy link
Contributor Author

hhaim commented Mar 3, 2021

@jsmoon I see the other threads on backward compatibility
We will keep the new commands for read-only mode backward compatible

hhaim added a commit that referenced this issue Mar 22, 2023
Signed-off-by: Hanoh Haim <hhaim@cisco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants