Bad DNS or Bad API tokens cause silent failure, infinite queue growth #4

Open
markstos opened this Issue May 21, 2015 · 4 comments

Projects

None yet

2 participants

@markstos
Contributor

I created the below script as "t.js" in the project directory to test what happens with a Bad API token and also what happens when DNS fails or the LogEntries service is down. To simulate the latter, I added an entry to my local /etc/hosts that maps the DNS calls to "127.0.0.1". I noticed some undesirable things happen:

  1. Despite immediately registering an error handler, no error is generated to tell me that my token is bad.
  2. The Bad DNS / failed generate an error either.
  3. Meanwhile, logging appears to be working and internally queued in a JavaScript array. I see no bounds on the array growth in the code, so while the code continues to silently fail, the Node process memory growth would continue to grow without bounds until the process crashed, or the OOM killer decides to kill some other process to save the OS itself from crashing.
  var LogentriesClient = require('./');

  // Here I am, giving you a bad token
  var logger = new LogentriesClient({
      token: 'BOOM'
  });

  // I immediate set up logging
  logger.on('error', function (error) {
    console.log("Oops, there was an error logging. %s",error);
  });

  // Now I'm sending some logs which should never go through
  console.log("before warning");
>>logger.warning('The kittens have become alarmingly adorable.')
>>logger.warning('The kittens have become alarmingly adorable.')
>>logger.warning('The kittens have become alarmingly adorable.')
>>logger.warning('The kittens have become alarmingly adorable.')
>>logger.warning('The kittens have become alarmingly adorable.')
>>logger.warning('The kittens have become alarmingly adorable.')
>>logger.warning('The kittens have become alarmingly adorable.')
>>logger.warning('The kittens have become alarmingly adorable.')
>>logger.warning('The kittens have become alarmingly adorable.')
>>logger.warning('The kittens have become alarmingly adorable.')
  console.log("after warning");

  // To give async tasks the benefit of a doubt, we pause a bit before quitting.
  // By this point, I would expect an error to be emitted.
  setTimeout(function () {
    process.exit();
  },5000);

To check the queue growth, I also patched the code like this to check the queue size:

diff --git a/lib/logger.js b/lib/logger.js
index 7b9cb14..a3db71b 100644
--- a/lib/logger.js
+++ b/lib/logger.js
@@ -152,6 +152,8 @@ Object.defineProperties(Logger.prototype, {
                value: function(log) {
                        this._queue.push(log);

+      console.log("Current queue length is: "+this._queue.length);
+
                        if (this._socket == 'connecting') return;

                        if (this._state == 'connected') return this._processQueue();
@markstos markstos referenced this issue in rapid7/le_node May 21, 2015
Closed

wish: document failure modes #47

@bathos
Owner
bathos commented May 21, 2015

Good catch. The logic pattern for the net connection and message queue were lifted from le_node without much examination. Reviewing it, I realize the whole thing ought to be a write stream first of all, rather than reinventing the concept "part way". Then the only queue would be the native internal stream buffer, which will have a high water mark etc.

As for the tcp socket not passing error events to the Client object, that should be an easy fix. One thing I'm unclear on -- I really don't know network traffic -- is whether there's any means for us to confirm that a given chunk was sent successfully. My impression is that there really isn't -- it appears that all you can know is if the connection is open, authorized, failed, or disconnected.

@bathos bathos self-assigned this May 21, 2015
@markstos
Contributor

Thanks for the fast response. I'm not sure about how to check for the successful data write either. My plan would always be to log both locally and remotely, so I have two chances to succeed.

@bathos
Owner
bathos commented May 21, 2015

I'll begin a more thorough review of the library structure on an upcoming weekend. This is the main area where it's lacking but it shouldn't be tough to improve on -- I just want it to be solid. So my plan is that the fixes for these issues -- and the refactoring necessity that they really hint at -- will bring us to v 1.0.0.

Feel free to open issues for any features or improvements you'd like to see, btw -- I'm in a vacuum over here.

@bathos
Owner
bathos commented May 24, 2015

I've begun the overhaul. A few things I've noticed so far:

Connection errors already were emitted on the client; I just didn't remember those handlers get registered in a method on the client and not in logger-connection.js. So I'm not really clear on why your error handler was not invoked. An error event occurs correctly when the DNS is bad for me, so I have to ask -- is it possible that your local port 80 was open? If so, there'd be no way to know it wasn't the correct endpoint. If not, perhaps you can tell me more about the test you did that produced this result.

However there were certainly problems with how the connection was handled. In the event of a timeout, which for some reason I had set to an overlong three minutes, we were retrying but without a proper exponential backoff. Plus, no matter how many times it timed out, it wouldn't have ever given up or emitted an event to indicate that there was a problem. This will be fixed. Timeout is for inactivity on the socket I just remembered, not connection timeout. The three minutes was intentional.

As for invalid tokens, I don't have good news. I've confirmed that there's never any response, not even if the data sent was garbage -- so the only way to discover that your token is bad is to not see logs coming through on Logentries. This isn't really acceptable since it means you can't test whether your logging system is actually working. I'm going to investigate whether there's some other data-returning endpoint that we could exploit to do confirmation of token validity before the first transmission attempt. But I think this might be a case where we need to open a new Trello ticket for the backend guys.

In the meantime I've extended the initialization sanity checks to include a regex pattern test on the token; they have a rather specific format so it's at least possible to rule out nonsense strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment