Epic: Rework remoteConfig() to support command and control capabilities #498

jrcheli · 2021-08-19T18:41:11Z

This is one of the tasks that was noted in #305. Allow reception of commands from an external source again.

ledbit · 2021-11-29T17:47:31Z

We had to globally disable remote config when support for TLS was added.

jrcheli · 2022-03-14T19:19:22Z

As I starting point, I wanted to get remoteConfig working for non-tls connections again. To do this I made these changes to a stock v1.0.2:

ubuntu@ip-10-8-107-159:~/jrc/appscope3$ git diff
diff --git a/src/wrap.c b/src/wrap.c
index c5a527d7..edc98b90 100644
--- a/src/wrap.c
+++ b/src/wrap.c
@@ -448,10 +448,6 @@ remoteConfig()
     timeout = 1;
     memset(&fds, 0x0, sizeof(fds));
 
-/*
-    Setting fds.events = 0 to neuter ability to process remote
-    commands... until this is function is reworked to be TLS-friendly.
-
     cfg_transport_t ttype = ctlTransportType(g_ctl, CFG_CTL);
     if ((ttype == (cfg_transport_t)-1) || (ttype == CFG_FILE) ||
         (ttype ==  CFG_SYSLOG) || (ttype == CFG_SHM)) {
@@ -459,9 +455,7 @@ remoteConfig()
     } else {
         fds.events = POLLIN;
     }
-*/
 
-    fds.events = 0;
     fds.fd = ctlConnection(g_ctl, CFG_CTL);

Building with these changes allowed me to successfully communicate with a scoped process. For specifics, first I started an app that would stick around a while in one shell:

LD_PRELOAD=lib/linux/x86_64/libscope.so SCOPE_CRIBL_ENABLE=false sleep 10000

Then in another shell, I created a /tmp/cmdin file with what should be a valid command, launched tcpserver, and then typed "U" and enter in the tcpserver shell to tell tcpserver to send the contents of /tmp/cmdin to the sleep process above.

echo '{ "type": "req", "req": "GetCfg", "reqId": 987413948756391 }' > /tmp/cmdin
tcpserver 9109
U<ENTER>

What shows that it was working, was that the scoped process responded by sending a proper response to that command:

U
tcp:311
tcp:333 fds[2].fd=4 rc 61
{ "type": "req", "req": "GetCfg", "reqId": 987413948756391 }

{"type":"resp","body":{"current":{"metric":{"enable":"true","transport":{"type":"udp","host":"127.0.0.1","port":"8125","tls":{"enable":"false","validateserver":"true","cacertpath":""}},"format":{"type":"statsd","statsdprefix":"","statsdmaxlen":512,"verbosity":4},"watch":[{"type":"statsd"}]},"libscope":{"log":{"level":"warning","transport":{"type":"file","path":"/tmp/scope.log","buffering":"line"}},"configevent":"true","summaryperiod":10,"commanddir":"/tmp"},"event":{"enable":"true","transport":{"type":"tcp","host":"127.0.0.1","port":"9109","tls":{"enable":"false","validateserver":"true","cacertpath":""}},"format":{"type":"ndjson","maxeventpersec":10000,"enhancefs":"true"},"watch":[{"type":"file","name":"(\\/logs?\\/)|(\\.log$)|(\\.log[.\\d])","field":".*","value":".*"},{"type":"console","name":"(stdout)|(stderr)","field":".*","value":".*"},{"type":"http","name":".*","field":".*","value":".*","headers":[]},{"type":"net","name":".*","field":".*","value":".*"},{"type":"fs","name":".*","field":".*","value":".*"},{"type":"dns","name":".*","field":".*","value":".*"}]},"payload":{"enable":"false","dir":"/tmp"},"tags":{},"protocol":[]}},"req":"GetCfg","reqId":987413948756391,"status":200}

jrcheli · 2022-03-14T19:46:03Z

Without any changes from the code above, I can see that TLS is no bueno, which is why this ticket was written in the first place:
In one shell:

LD_PRELOAD=lib/linux/x86_64/libscope.so SCOPE_CRIBL_ENABLE=false SCOPE_EVENT_TLS_ENABLE=true SCOPE_EVENT_TLS_VALIDATE_SERVER=false sleep 10000

In a second shell:

openssl req -nodes -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem
./tcpserver -t 9109

What I saw:

ubuntu@ip-10-8-107-159:~/jrc/appscope3$ ./tcpserver -t 9109
Server set up parent TCP socket.
TCP connection accepted.
TLS connection accepted.
{"format":"ndjson","info":{"process":{"libscopever":"v1.0.2+","pid":368,"ppid":26597,"gid":1000,"groupname":"ubuntu","uid":1000,"username":"ubuntu","hostname":"ip-10-8-107-159","procname":"sleep","cmd":"sleep 1000","id":"ip-10-8-107-159-sleep-sleep 1000"},"configuration":{"current":{"metric":{"enable":"true","transport":{"type":"udp","host":"127.0.0.1","port":"8125","tls":{"enable":"false","validateserver":"true","cacertpath":""}},"format":{"type":"statsd","statsdprefix":"","statsdmaxlen":512,"verbosity":4},"watch":[{"type":"statsd"}]},"libscope":{"log":{"level":"warning","transport":{"type":"file","path":"/tmp/scope.log","buffering":"line"}},"configevent":"true","summaryperiod":10,"commanddir":"/tmp"},"event":{"enable":"true","transport":{"type":"tcp","host":"127.0.0.1","port":"9109","tls":{"enable":"true","validateserver":"false","cacertpath":""}},"format":{"type":"ndjson","maxeventpersec":10000,"enhancefs":"true"},"watch":[{"type":"file","name":"(\\/logs?\\/)|(\\.log$)|(\\.log[.\\d])","field":".*","value":".*"},{"type":"console","name":"(stdout)|(stderr)","field":".*","value":".*"},{"type":"http","name":".*","field":".*","value":".*","headers":[]},{"type":"net","name":".*","field":".*","value":".*"},{"type":"fs","name":".*","field":".*","value":".*"},{"type":"dns","name":".*","field":".*","value":".*"}]},"payload":{"enable":"false","dir":"/tmp"},"tags":{},"protocol":[]}},"environment":{}}}
{"type":"evt","id":"ip-10-8-107-159-sleep-sleep 1000","_channel":"779630675201980","body":{"sourcetype":"fs","_time":1647283718.375675,"source":"fs.open","host":"ip-10-8-107-159","proc":"sleep","cmd":"sleep 1000","pid":368,"data":{"proc":"sleep","pid":368,"host":"ip-10-8-107-159","file":"/etc/ssl/certs/ca-certificates.crt","proc_uid":1000,"proc_gid":1000,"proc_cgroup":"0::/user.slice/user-1000.slice/session-277.scope","file_perms":644,"file_owner":0,"file_group":0,"op":"fopen64"}}}
{"type":"evt","id":"ip-10-8-107-159-sleep-sleep 1000","_channel":"779630675201980","body":{"sourcetype":"fs","_time":1647283718.381156,"source":"fs.close","host":"ip-10-8-107-159","proc":"sleep","cmd":"sleep 1000","pid":368,"data":{"proc":"sleep","pid":368,"host":"ip-10-8-107-159","file":"/etc/ssl/certs/ca-certificates.crt","proc_uid":1000,"proc_gid":1000,"proc_cgroup":"0::/user.slice/user-1000.slice/session-277.scope","file_perms":644,"file_owner":0,"file_group":0,"file_read_bytes":834615,"file_read_ops":3273,"file_write_bytes":0,"file_write_ops":0,"duration":0,"op":"fclose"}}}
Server thinks a client closed a TLS session
Server shut down a TLS session.
Server thinks a client shut down the TLS session.
Server closed TCP socket.
Server shut down parent TCP socket.
Server closed parent TCP socket.

jrcheli · 2022-03-23T01:59:13Z

So, I've dug in to this a bit, and I think I've figured out how to add command/control over tls. It's not totally trivial. I'll try to explain why here. My conclusion as I write this is that I shouldn't do this now just due to time constraints at the moment. Instead I'll try to leave enough breadcrumbs of a description to explain what it would take should we decide to do this later...

First, what happened above where I said "TLS is no bueno" is:

when we call SSL_connect() in transport.c:establishTlsConnection() the resulting handshaking/negotiation consists of some number of sends and receives behind the scenes.
The receives cause some incoming tls data to be later picked up in on the poll() in wrap.c:in remoteConfig().
We expect the incoming data to be a request/command, but what we're actually receiving in remoteConfig() is tls handshaking/negotiation data.
Continuing in remoteConfig(), we try to parse incoming tls data as if it was a command, it doesn't parse and so we throw up our hands and close the connection as a crude kind of error handling.

My first attempt to fix this failed and it took a bit to figure out why that approach was not viable. What I tried to do is equivalent to the thing the original guy asking the question in this thread tried... https://openssl-users.openssl.narkive.com/l4JsYKS8/ssl-peek-vs-ssl-pending

After the poll() returns and says there is incoming data, I tried calling SSL_has_pending() to see if there was ssl data, planning to use SCOPE_SSL_read() to handle this if there was. But... the SSL_has_pending() never returned true even when I could see that the data was tls stuff! After finding the above link and pouring over it a few times, I came to understand that for the state of the ssl subsystem to work correctly, I can't interleave poll()s with SSL_ calls in this way. (poll and select are equivalent for the sake of understanding what's going on here)

This reads to me a little like a zen poem but taking this to heart and following it's suggestion helped me get where I needed to be:

If one thinks they need to use select() on a blocking socket, 
  use non-blocking sockets instead. 
And only when non-blocking sockets are insufficient, 
  use select()

After we establish a socket connection using an async (non-blocking) socket, we've always switched the socket to be blocking, then if desired, did the establishTlsConnection() stuff and all later stuff as a blocking socket. This zen poem made me think that I needed to switch things to be non-blocking to have a chance.
The first thing I tried after reading this was to stop switching the socket to be blocking. I think this could work, but didn't for me because the the SSL_connect() in establishTlsConnection() failed. Presumably I needed to do more work to handle errors due to the socket being non-blocking. Hmmm...

I then realized that for the sake of investigation at least, I could let the establishTlsConnection() run as a blocking socket as we always have, and then after the TLS connection was established, I could switch the socket to be non-blocking. This got me farther. At this point I have a tls connection working, and just need to figure out how to read the command/control data in a way that doesn't hose the stateful nature of our TLS subsystem.

To do this in remoteConfig(), before we poll() we can now call SCOPE_SSL_read() because it won't block. I've seen it return no error, and I've seen it return SSL_ERROR_WANT_READ. If the error is SSL_ERROR_WANT_READ, now it's ok to poll in a way that won't hose up the TLS subsystems' state. Voilá! This is how it would be possible to support command/control over TLS. The recipe is if we're using tls to 1) use non-blocking sockets, and 2) be sure to call SCOPE_SSL_read() before calling poll().

jrcheli · 2022-03-23T16:24:19Z

To capture the state of my experimental code before I commit anything, I captured diffs from 3f90a84 version, that I'm attaching here. It's absolutely not ready for primetime, but might be something to refer to if my comments above are too cryptic/hard to follow...
exploring.patch.txt

To rehydrate this, git clone git@github.com:criblio/appscope.git, git checkout 3f90a849d6df43b98701e5f271d1b092df80fcd7, patch -p1 < exploring.patch.txt

jrcheli · 2022-03-24T21:38:59Z

Ok. So the current plan is to support command and control on tcp and unix connections, with the limitation that we won't support tls right now.

With this agreement, I think the commit here on the feature/498-remote-config branch is ready for prime time and can now be merged if needed.

jrcheli · 2022-05-10T20:37:42Z

I originally saw an issue with tcpserver and tls, during the timeframe of the above comment (Mar 24, 2022).
It could be seen when the scoped process was started after tcpserver in tls mode. The good news is that after merging master into the feature/498-remote-config branch, I don't see this behavior anymore. I can start tcpserver first, or start the scoped process first, and have not seen that issue I originally observed. I don't know what the original cause was or what part of the merged code solves it, but I don't see a reason to go deeper now.

jrcheli · 2022-05-11T19:09:12Z

The final set of testing I performed:
0) Generate keys used below

openssl req -nodes -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem

Verify that tcpserver without tls, can receive data from AppScope

<terminal1> LD_PRELOAD=lib/linux/x86_64/libscope.so SCOPE_CRIBL_ENABLE=false top
<terminal2> ./tcpserver 9109

Verify that tcpserver without tls can send commands that generate a response from AppScope

<terminal1> LD_PRELOAD=lib/linux/x86_64/libscope.so SCOPE_CRIBL_ENABLE=false top
<terminal2> echo '{ "type": "req", "req": "GetCfg", "reqId": 987413948756391 }' > /tmp/cmdin
<terminal2> ./tcpserver 9109
<terminal2> U <enter>

Verify that tcpserver with tls can receive data from AppScope

<terminal1> LD_PRELOAD=lib/linux/x86_64/libscope.so SCOPE_CRIBL_ENABLE=false SCOPE_EVENT_TLS_ENABLE=true SCOPE_EVENT_TLS_VALIDATE_SERVER=false SCOPE_EVENT_TLS_CA_CERT_PATH=./cert.pem top
<terminal2> ./tcpserver -t 9109

Verify that tcpserver with tls can send commands, but that it won't interfere with the AppScope'd program ( AppScope will not respond )

<terminal1> LD_PRELOAD=lib/linux/x86_64/libscope.so SCOPE_CRIBL_ENABLE=false SCOPE_EVENT_TLS_ENABLE=true SCOPE_EVENT_TLS_VALIDATE_SERVER=false SCOPE_EVENT_TLS_CA_CERT_PATH=./cert.pem top
<terminal2> ./tcpserver -t 9109
<terminal2> U <enter>

All tests passed.

jrcheli · 2022-05-12T20:29:55Z

During the review, @iapaddler asked whether we really want to do a ctlDisconnect() in the "no bueno" situation above... #498 (comment)

After talking, it does seem like receiving garbage on the ctl channel should not cause us to drop the connection. So I made one more change here... removed the ctlDisconnect() from remoteConfig().

To test this:
Before making this change, I stripped the trailing newline character off of the /tmp/cmdin file I had been using above with truncate -s -1 /tmp/cmdin and observed that a scoped process would kill the connection in response from a "U" from the tcpserver. I could observe the connection being killed by looking at the following watch command, and seeing that the socket descriptor for port 9109 changes in response to the "U" command. (The library is disconnecting, then reconnecting on a new socket descriptor)

watch "lsof -p <pidof top>"

The output of tcpserver makes it clear the connection is getting dropped too:

{ "type": "req", "req": "GetCfg", "reqId": 987413948756391 }
server established connection on [2].4 with localhost (127.0.0.1:43556)
{"format":"ndjson","info":{"process":{"libscopever":"v1.1.0-tc0-25-g1bf647542925","pid":3014,"ppid":21968,"gid":1000,"groupname":"ubuntu","uid":1000,"username":"ubuntu","hostname":"ip-10-8-107-159","procname":"top","cmd":"top","id":"ip-10-8-107-159-top-top"},"configuration":{"current":{"metric":{"enable":"true","transport":{"type":"udp","host":"127.0.0.1","port":"8125","tls":{"enable":"false","validateserver":"true","cacertpath":""}},"format":{"type":"statsd","statsdprefix":"","statsdmaxlen":512,"verbosity":4},"watch":[{"type":"statsd"}]},"libscope":{"log":{"level":"warning","transport":{"type":"file","path":"/tmp/scope.log","buffering":"line"}},"configevent":"true","summaryperiod":10,"commanddir":"/tmp"},"event":{"enable":"true","transport":{"type":"tcp","host":"127.0.0.1","port":"9109","tls":{"enable":"false","validateserver":"true","cacertpath":""}},"format":{"type":"ndjson","maxeventpersec":10000,"enhancefs":"true"},"watch":[{"type":"file","name":"(\\/logs?\\/)|(\\.log$)|(\\.log[.\\d])","field":".*","value":".*"},{"type":"console","name":"(stdout)|(stderr)","field":".*","value":".*"},{"type":"http","name":".*","field":".*","value":".*","headers":[]},{"type":"net","name":".*","field":".*","value":".*"},{"type":"fs","name":".*","field":".*","value":".*"},{"type":"dns","name":".*","field":".*","value":".*"}]},"payload":{"enable":"false","dir":"/tmp"},"tags":{},"protocol":[]}},"environment":{}}}

After making this change, I ran the same test, and observed with a similar watch command that appscope did not kill the connection for port 9109. I could also see the difference from the output of tcpserver:

{ "type": "req", "req": "GetCfg", "reqId": 987413948756391 }
{"type":"info","body":"Error in receive from stream.  Scope receive retries exhausted."}

I'm considering this a successful test. ✅

jrcheli added this to the 0.7.3 milestone Aug 19, 2021

jrcheli self-assigned this Aug 19, 2021

jrcheli mentioned this issue Aug 19, 2021

Continue tls development #3 #305

Closed

10 tasks

ghost modified the milestones: 0.7.6, Backlog Sep 13, 2021

iapaddler added the libscope label Oct 27, 2021

iapaddler modified the milestones: Backlog, 1.0.0 Oct 27, 2021

iapaddler unassigned jrcheli Oct 28, 2021

ledbit added the Agent label Nov 29, 2021

iapaddler added the Aspirational label Nov 30, 2021

ghost modified the milestones: Next Major (1.0.0), Backlog Feb 3, 2022

iapaddler modified the milestones: Backlog, Next Minor (1.1.0) Feb 17, 2022

iapaddler removed Aspirational Agent labels Feb 17, 2022

seanvaleo pinned this issue Mar 11, 2022

jrcheli self-assigned this Mar 14, 2022

iapaddler changed the title ~~Rework remoteConfig() to use tls for command/response with logstream~~ Rework remoteConfig() to support command and control capabilities Mar 14, 2022

iapaddler added the Command and Control label Mar 14, 2022

This was referenced Mar 14, 2022

Establish a remote command and control source for dev and test #854

Closed

Support for a get config command #855

Closed

Support for a set config command #856

Closed

jrcheli added a commit that referenced this issue Mar 24, 2022

(#498) Enable external commands/requests in remoteConfig().

b50583b

iapaddler changed the title ~~Rework remoteConfig() to support command and control capabilities~~ Epic: Rework remoteConfig() to support command and control capabilities Apr 6, 2022

iapaddler unpinned this issue May 2, 2022

This was referenced May 11, 2022

Remote config #948

Merged

Candidates for additional testing #951

Closed

jrcheli added a commit that referenced this issue May 12, 2022

(#498) Updated per review comments.

1bf6475

jrcheli added a commit that referenced this issue May 12, 2022

(#498) Don't disconnect if there are issues w/the incoming message

a36cb3f

iapaddler closed this as completed in #948 May 13, 2022

abetones mentioned this issue Jun 10, 2022

[Docs]: 1.1.0 Changelog #877

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: Rework remoteConfig() to support command and control capabilities #498

Epic: Rework remoteConfig() to support command and control capabilities #498

jrcheli commented Aug 19, 2021

ledbit commented Nov 29, 2021

jrcheli commented Mar 14, 2022

jrcheli commented Mar 14, 2022

jrcheli commented Mar 23, 2022 •

edited

jrcheli commented Mar 23, 2022

jrcheli commented Mar 24, 2022 •

edited

jrcheli commented May 10, 2022 •

edited

jrcheli commented May 11, 2022 •

edited

jrcheli commented May 12, 2022 •

edited

Epic: Rework remoteConfig() to support command and control capabilities #498

Epic: Rework remoteConfig() to support command and control capabilities #498

Comments

jrcheli commented Aug 19, 2021

ledbit commented Nov 29, 2021

jrcheli commented Mar 14, 2022

jrcheli commented Mar 14, 2022

jrcheli commented Mar 23, 2022 • edited

jrcheli commented Mar 23, 2022

jrcheli commented Mar 24, 2022 • edited

jrcheli commented May 10, 2022 • edited

jrcheli commented May 11, 2022 • edited

jrcheli commented May 12, 2022 • edited

jrcheli commented Mar 23, 2022 •

edited

jrcheli commented Mar 24, 2022 •

edited

jrcheli commented May 10, 2022 •

edited

jrcheli commented May 11, 2022 •

edited

jrcheli commented May 12, 2022 •

edited