Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: Rework remoteConfig() to support command and control capabilities #498

Closed
jrcheli opened this issue Aug 19, 2021 · 9 comments · Fixed by #948
Closed

Epic: Rework remoteConfig() to support command and control capabilities #498

jrcheli opened this issue Aug 19, 2021 · 9 comments · Fixed by #948

Comments

@jrcheli
Copy link
Contributor

jrcheli commented Aug 19, 2021

This is one of the tasks that was noted in #305. Allow reception of commands from an external source again.

@jrcheli jrcheli added this to the 0.7.3 milestone Aug 19, 2021
@jrcheli jrcheli self-assigned this Aug 19, 2021
@jrcheli jrcheli mentioned this issue Aug 19, 2021
10 tasks
@ghost ghost modified the milestones: 0.7.6, Backlog Sep 13, 2021
@iapaddler iapaddler modified the milestones: Backlog, 1.0.0 Oct 27, 2021
@ledbit
Copy link
Contributor

ledbit commented Nov 29, 2021

We had to globally disable remote config when support for TLS was added.

@ledbit ledbit added the Agent label Nov 29, 2021
@ghost ghost modified the milestones: Next Major (1.0.0), Backlog Feb 3, 2022
@iapaddler iapaddler modified the milestones: Backlog, Next Minor (1.1.0) Feb 17, 2022
@seanvaleo seanvaleo pinned this issue Mar 11, 2022
@jrcheli jrcheli self-assigned this Mar 14, 2022
@iapaddler iapaddler changed the title Rework remoteConfig() to use tls for command/response with logstream Rework remoteConfig() to support command and control capabilities Mar 14, 2022
@jrcheli
Copy link
Contributor Author

jrcheli commented Mar 14, 2022

As I starting point, I wanted to get remoteConfig working for non-tls connections again. To do this I made these changes to a stock v1.0.2:

ubuntu@ip-10-8-107-159:~/jrc/appscope3$ git diff
diff --git a/src/wrap.c b/src/wrap.c
index c5a527d7..edc98b90 100644
--- a/src/wrap.c
+++ b/src/wrap.c
@@ -448,10 +448,6 @@ remoteConfig()
     timeout = 1;
     memset(&fds, 0x0, sizeof(fds));
 
-/*
-    Setting fds.events = 0 to neuter ability to process remote
-    commands... until this is function is reworked to be TLS-friendly.
-
     cfg_transport_t ttype = ctlTransportType(g_ctl, CFG_CTL);
     if ((ttype == (cfg_transport_t)-1) || (ttype == CFG_FILE) ||
         (ttype ==  CFG_SYSLOG) || (ttype == CFG_SHM)) {
@@ -459,9 +455,7 @@ remoteConfig()
     } else {
         fds.events = POLLIN;
     }
-*/
 
-    fds.events = 0;
     fds.fd = ctlConnection(g_ctl, CFG_CTL);

Building with these changes allowed me to successfully communicate with a scoped process. For specifics, first I started an app that would stick around a while in one shell:

LD_PRELOAD=lib/linux/x86_64/libscope.so SCOPE_CRIBL_ENABLE=false sleep 10000

Then in another shell, I created a /tmp/cmdin file with what should be a valid command, launched tcpserver, and then typed "U" and enter in the tcpserver shell to tell tcpserver to send the contents of /tmp/cmdin to the sleep process above.

echo '{ "type": "req", "req": "GetCfg", "reqId": 987413948756391 }' > /tmp/cmdin
tcpserver 9109
U<ENTER>

What shows that it was working, was that the scoped process responded by sending a proper response to that command:

U
tcp:311
tcp:333 fds[2].fd=4 rc 61
{ "type": "req", "req": "GetCfg", "reqId": 987413948756391 }

{"type":"resp","body":{"current":{"metric":{"enable":"true","transport":{"type":"udp","host":"127.0.0.1","port":"8125","tls":{"enable":"false","validateserver":"true","cacertpath":""}},"format":{"type":"statsd","statsdprefix":"","statsdmaxlen":512,"verbosity":4},"watch":[{"type":"statsd"}]},"libscope":{"log":{"level":"warning","transport":{"type":"file","path":"/tmp/scope.log","buffering":"line"}},"configevent":"true","summaryperiod":10,"commanddir":"/tmp"},"event":{"enable":"true","transport":{"type":"tcp","host":"127.0.0.1","port":"9109","tls":{"enable":"false","validateserver":"true","cacertpath":""}},"format":{"type":"ndjson","maxeventpersec":10000,"enhancefs":"true"},"watch":[{"type":"file","name":"(\\/logs?\\/)|(\\.log$)|(\\.log[.\\d])","field":".*","value":".*"},{"type":"console","name":"(stdout)|(stderr)","field":".*","value":".*"},{"type":"http","name":".*","field":".*","value":".*","headers":[]},{"type":"net","name":".*","field":".*","value":".*"},{"type":"fs","name":".*","field":".*","value":".*"},{"type":"dns","name":".*","field":".*","value":".*"}]},"payload":{"enable":"false","dir":"/tmp"},"tags":{},"protocol":[]}},"req":"GetCfg","reqId":987413948756391,"status":200}

@jrcheli
Copy link
Contributor Author

jrcheli commented Mar 14, 2022

Without any changes from the code above, I can see that TLS is no bueno, which is why this ticket was written in the first place:
In one shell:

LD_PRELOAD=lib/linux/x86_64/libscope.so SCOPE_CRIBL_ENABLE=false SCOPE_EVENT_TLS_ENABLE=true SCOPE_EVENT_TLS_VALIDATE_SERVER=false sleep 10000

In a second shell:

openssl req -nodes -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem
./tcpserver -t 9109

What I saw:

ubuntu@ip-10-8-107-159:~/jrc/appscope3$ ./tcpserver -t 9109
Server set up parent TCP socket.
TCP connection accepted.
TLS connection accepted.
{"format":"ndjson","info":{"process":{"libscopever":"v1.0.2+","pid":368,"ppid":26597,"gid":1000,"groupname":"ubuntu","uid":1000,"username":"ubuntu","hostname":"ip-10-8-107-159","procname":"sleep","cmd":"sleep 1000","id":"ip-10-8-107-159-sleep-sleep 1000"},"configuration":{"current":{"metric":{"enable":"true","transport":{"type":"udp","host":"127.0.0.1","port":"8125","tls":{"enable":"false","validateserver":"true","cacertpath":""}},"format":{"type":"statsd","statsdprefix":"","statsdmaxlen":512,"verbosity":4},"watch":[{"type":"statsd"}]},"libscope":{"log":{"level":"warning","transport":{"type":"file","path":"/tmp/scope.log","buffering":"line"}},"configevent":"true","summaryperiod":10,"commanddir":"/tmp"},"event":{"enable":"true","transport":{"type":"tcp","host":"127.0.0.1","port":"9109","tls":{"enable":"true","validateserver":"false","cacertpath":""}},"format":{"type":"ndjson","maxeventpersec":10000,"enhancefs":"true"},"watch":[{"type":"file","name":"(\\/logs?\\/)|(\\.log$)|(\\.log[.\\d])","field":".*","value":".*"},{"type":"console","name":"(stdout)|(stderr)","field":".*","value":".*"},{"type":"http","name":".*","field":".*","value":".*","headers":[]},{"type":"net","name":".*","field":".*","value":".*"},{"type":"fs","name":".*","field":".*","value":".*"},{"type":"dns","name":".*","field":".*","value":".*"}]},"payload":{"enable":"false","dir":"/tmp"},"tags":{},"protocol":[]}},"environment":{}}}
{"type":"evt","id":"ip-10-8-107-159-sleep-sleep 1000","_channel":"779630675201980","body":{"sourcetype":"fs","_time":1647283718.375675,"source":"fs.open","host":"ip-10-8-107-159","proc":"sleep","cmd":"sleep 1000","pid":368,"data":{"proc":"sleep","pid":368,"host":"ip-10-8-107-159","file":"/etc/ssl/certs/ca-certificates.crt","proc_uid":1000,"proc_gid":1000,"proc_cgroup":"0::/user.slice/user-1000.slice/session-277.scope","file_perms":644,"file_owner":0,"file_group":0,"op":"fopen64"}}}
{"type":"evt","id":"ip-10-8-107-159-sleep-sleep 1000","_channel":"779630675201980","body":{"sourcetype":"fs","_time":1647283718.381156,"source":"fs.close","host":"ip-10-8-107-159","proc":"sleep","cmd":"sleep 1000","pid":368,"data":{"proc":"sleep","pid":368,"host":"ip-10-8-107-159","file":"/etc/ssl/certs/ca-certificates.crt","proc_uid":1000,"proc_gid":1000,"proc_cgroup":"0::/user.slice/user-1000.slice/session-277.scope","file_perms":644,"file_owner":0,"file_group":0,"file_read_bytes":834615,"file_read_ops":3273,"file_write_bytes":0,"file_write_ops":0,"duration":0,"op":"fclose"}}}
Server thinks a client closed a TLS session
Server shut down a TLS session.
Server thinks a client shut down the TLS session.
Server closed TCP socket.
Server shut down parent TCP socket.
Server closed parent TCP socket.

@jrcheli
Copy link
Contributor Author

jrcheli commented Mar 23, 2022

So, I've dug in to this a bit, and I think I've figured out how to add command/control over tls. It's not totally trivial. I'll try to explain why here. My conclusion as I write this is that I shouldn't do this now just due to time constraints at the moment. Instead I'll try to leave enough breadcrumbs of a description to explain what it would take should we decide to do this later...

First, what happened above where I said "TLS is no bueno" is:

  1. when we call SSL_connect() in transport.c:establishTlsConnection() the resulting handshaking/negotiation consists of some number of sends and receives behind the scenes.
  2. The receives cause some incoming tls data to be later picked up in on the poll() in wrap.c:in remoteConfig().
  3. We expect the incoming data to be a request/command, but what we're actually receiving in remoteConfig() is tls handshaking/negotiation data.
  4. Continuing in remoteConfig(), we try to parse incoming tls data as if it was a command, it doesn't parse and so we throw up our hands and close the connection as a crude kind of error handling.

My first attempt to fix this failed and it took a bit to figure out why that approach was not viable. What I tried to do is equivalent to the thing the original guy asking the question in this thread tried... https://openssl-users.openssl.narkive.com/l4JsYKS8/ssl-peek-vs-ssl-pending

After the poll() returns and says there is incoming data, I tried calling SSL_has_pending() to see if there was ssl data, planning to use SCOPE_SSL_read() to handle this if there was. But... the SSL_has_pending() never returned true even when I could see that the data was tls stuff! After finding the above link and pouring over it a few times, I came to understand that for the state of the ssl subsystem to work correctly, I can't interleave poll()s with SSL_ calls in this way. (poll and select are equivalent for the sake of understanding what's going on here)

This reads to me a little like a zen poem but taking this to heart and following it's suggestion helped me get where I needed to be:

If one thinks they need to use select() on a blocking socket, 
  use non-blocking sockets instead. 
And only when non-blocking sockets are insufficient, 
  use select() 

After we establish a socket connection using an async (non-blocking) socket, we've always switched the socket to be blocking, then if desired, did the establishTlsConnection() stuff and all later stuff as a blocking socket. This zen poem made me think that I needed to switch things to be non-blocking to have a chance.
The first thing I tried after reading this was to stop switching the socket to be blocking. I think this could work, but didn't for me because the the SSL_connect() in establishTlsConnection() failed. Presumably I needed to do more work to handle errors due to the socket being non-blocking. Hmmm...

I then realized that for the sake of investigation at least, I could let the establishTlsConnection() run as a blocking socket as we always have, and then after the TLS connection was established, I could switch the socket to be non-blocking. This got me farther. At this point I have a tls connection working, and just need to figure out how to read the command/control data in a way that doesn't hose the stateful nature of our TLS subsystem.

To do this in remoteConfig(), before we poll() we can now call SCOPE_SSL_read() because it won't block. I've seen it return no error, and I've seen it return SSL_ERROR_WANT_READ. If the error is SSL_ERROR_WANT_READ, now it's ok to poll in a way that won't hose up the TLS subsystems' state. Voilá! This is how it would be possible to support command/control over TLS. The recipe is if we're using tls to 1) use non-blocking sockets, and 2) be sure to call SCOPE_SSL_read() before calling poll().

@jrcheli
Copy link
Contributor Author

jrcheli commented Mar 23, 2022

To capture the state of my experimental code before I commit anything, I captured diffs from 3f90a84 version, that I'm attaching here. It's absolutely not ready for primetime, but might be something to refer to if my comments above are too cryptic/hard to follow...
exploring.patch.txt

To rehydrate this, git clone git@github.com:criblio/appscope.git, git checkout 3f90a849d6df43b98701e5f271d1b092df80fcd7, patch -p1 < exploring.patch.txt

@jrcheli
Copy link
Contributor Author

jrcheli commented Mar 24, 2022

Ok. So the current plan is to support command and control on tcp and unix connections, with the limitation that we won't support tls right now.

With this agreement, I think the commit here on the feature/498-remote-config branch is ready for prime time and can now be merged if needed.

@iapaddler iapaddler changed the title Rework remoteConfig() to support command and control capabilities Epic: Rework remoteConfig() to support command and control capabilities Apr 6, 2022
@iapaddler iapaddler unpinned this issue May 2, 2022
@jrcheli
Copy link
Contributor Author

jrcheli commented May 10, 2022

I originally saw an issue with tcpserver and tls, during the timeframe of the above comment (Mar 24, 2022).
It could be seen when the scoped process was started after tcpserver in tls mode. The good news is that after merging master into the feature/498-remote-config branch, I don't see this behavior anymore. I can start tcpserver first, or start the scoped process first, and have not seen that issue I originally observed. I don't know what the original cause was or what part of the merged code solves it, but I don't see a reason to go deeper now.

@jrcheli
Copy link
Contributor Author

jrcheli commented May 11, 2022

The final set of testing I performed:
0) Generate keys used below

openssl req -nodes -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem
  1. Verify that tcpserver without tls, can receive data from AppScope
<terminal1> LD_PRELOAD=lib/linux/x86_64/libscope.so SCOPE_CRIBL_ENABLE=false top
<terminal2> ./tcpserver 9109
  1. Verify that tcpserver without tls can send commands that generate a response from AppScope
<terminal1> LD_PRELOAD=lib/linux/x86_64/libscope.so SCOPE_CRIBL_ENABLE=false top
<terminal2> echo '{ "type": "req", "req": "GetCfg", "reqId": 987413948756391 }' > /tmp/cmdin
<terminal2> ./tcpserver 9109
<terminal2> U <enter>
  1. Verify that tcpserver with tls can receive data from AppScope
<terminal1> LD_PRELOAD=lib/linux/x86_64/libscope.so SCOPE_CRIBL_ENABLE=false SCOPE_EVENT_TLS_ENABLE=true SCOPE_EVENT_TLS_VALIDATE_SERVER=false SCOPE_EVENT_TLS_CA_CERT_PATH=./cert.pem top
<terminal2> ./tcpserver -t 9109
  1. Verify that tcpserver with tls can send commands, but that it won't interfere with the AppScope'd program ( AppScope will not respond )
<terminal1> LD_PRELOAD=lib/linux/x86_64/libscope.so SCOPE_CRIBL_ENABLE=false SCOPE_EVENT_TLS_ENABLE=true SCOPE_EVENT_TLS_VALIDATE_SERVER=false SCOPE_EVENT_TLS_CA_CERT_PATH=./cert.pem top
<terminal2> ./tcpserver -t 9109
<terminal2> U <enter>

All tests passed.

This was referenced May 11, 2022
jrcheli added a commit that referenced this issue May 12, 2022
@jrcheli
Copy link
Contributor Author

jrcheli commented May 12, 2022

During the review, @iapaddler asked whether we really want to do a ctlDisconnect() in the "no bueno" situation above... #498 (comment)

After talking, it does seem like receiving garbage on the ctl channel should not cause us to drop the connection. So I made one more change here... removed the ctlDisconnect() from remoteConfig().

To test this:
Before making this change, I stripped the trailing newline character off of the /tmp/cmdin file I had been using above with truncate -s -1 /tmp/cmdin and observed that a scoped process would kill the connection in response from a "U" from the tcpserver. I could observe the connection being killed by looking at the following watch command, and seeing that the socket descriptor for port 9109 changes in response to the "U" command. (The library is disconnecting, then reconnecting on a new socket descriptor)

watch "lsof -p <pidof top>"

The output of tcpserver makes it clear the connection is getting dropped too:

{ "type": "req", "req": "GetCfg", "reqId": 987413948756391 }
server established connection on [2].4 with localhost (127.0.0.1:43556)
{"format":"ndjson","info":{"process":{"libscopever":"v1.1.0-tc0-25-g1bf647542925","pid":3014,"ppid":21968,"gid":1000,"groupname":"ubuntu","uid":1000,"username":"ubuntu","hostname":"ip-10-8-107-159","procname":"top","cmd":"top","id":"ip-10-8-107-159-top-top"},"configuration":{"current":{"metric":{"enable":"true","transport":{"type":"udp","host":"127.0.0.1","port":"8125","tls":{"enable":"false","validateserver":"true","cacertpath":""}},"format":{"type":"statsd","statsdprefix":"","statsdmaxlen":512,"verbosity":4},"watch":[{"type":"statsd"}]},"libscope":{"log":{"level":"warning","transport":{"type":"file","path":"/tmp/scope.log","buffering":"line"}},"configevent":"true","summaryperiod":10,"commanddir":"/tmp"},"event":{"enable":"true","transport":{"type":"tcp","host":"127.0.0.1","port":"9109","tls":{"enable":"false","validateserver":"true","cacertpath":""}},"format":{"type":"ndjson","maxeventpersec":10000,"enhancefs":"true"},"watch":[{"type":"file","name":"(\\/logs?\\/)|(\\.log$)|(\\.log[.\\d])","field":".*","value":".*"},{"type":"console","name":"(stdout)|(stderr)","field":".*","value":".*"},{"type":"http","name":".*","field":".*","value":".*","headers":[]},{"type":"net","name":".*","field":".*","value":".*"},{"type":"fs","name":".*","field":".*","value":".*"},{"type":"dns","name":".*","field":".*","value":".*"}]},"payload":{"enable":"false","dir":"/tmp"},"tags":{},"protocol":[]}},"environment":{}}}

After making this change, I ran the same test, and observed with a similar watch command that appscope did not kill the connection for port 9109. I could also see the difference from the output of tcpserver:

{ "type": "req", "req": "GetCfg", "reqId": 987413948756391 }
{"type":"info","body":"Error in receive from stream.  Scope receive retries exhausted."}

I'm considering this a successful test. ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants