Internet Simulator: Monitor and limit request payload sizes #18142

islemaster · 2017-10-04T21:45:36Z

On Monday we saw brief downtime as server threads got tied up with very slow NetSimApi requests. After some investigation, we believe the root cause is a set of very large records in a single classroom simulation. This PR introduces metrics for NetSimApi request sizes and some simple limits on insert and update operations to make it less likely that this will happen again.

The Monday Incident

The NetSim Redis cache normally stays below 50MB, because most simulation content is very compact (messages are base-64 encoded in flight and in storage) and the simulation aggressively cleans up after itself. On Monday, the cache climbed to more than 1.2GB in the space of about 10 minutes. New Relic did not identify an increased throughput rate during this period, so I think this represents a normal number of requests with oversized payloads.

The large cache size caused significant slowdown to all read requests for the affected simulation - and the records we've found show very slow requests to only one simulation - because it took a long time to get results from the Redis server back to the frontend serving the request. As available frontend threads got tied up with these long Redis read operations we saw a cascading effect leading to brief site downtime.

Mitigating the vulnerability

We've had some discussions internally about insulating our site infrastructure from NetSim, but there are also NetSim-specific steps we can take to make this less likely to happen again. As a first step in this PR I am implementing a 5MB payload limit on insert and update requests (we impose a similar limit on file uploads). I'm also logging some metrics to New Relic to allow us to monitor real payload sizes. It's likely 5MB is much larger than this limit needs to be, and the metrics will allow us to fine-tune that value for a realistic usage scenario. Still, it should be a little harder to fill up a single simulation with a size limit in place.

Other steps I'd like to take include:

Making sure redis-rb respects its 5-second timeout on requests to Redis
This would hurt the affected simulation a bit sooner, but would make it much less likely that one bad simulation could affect the rest of the site.
Resetting a simulation that's grown to an unusable size.
Optimizing read operations to avoid the HGETALL command when possible.

Instruments some custom New Relic metrics in NetSimApi: - `Custom/NetSimApi/ResetShard` just for tracking how often this gets used. - `Custom/NetSimApi/InsertBytes_*` for tracking size of insert requests (per table). - `Custom/NetSimApi/InsertRows_*` for getting a sense of how multi-insert is being used (per table). - `Custom/NetSimApi/UpdateBytes_*` for tracking size of update requests (per table).

Apply an initial request size limit of 5MB to insert and update operations in NetSimApi. This is probably too high, but I don't want to lower it until we've gathered some metrics on normal request sizes.

davidsbailey · 2017-10-05T19:06:08Z

shared/middleware/net_sim_api.rb

+  def record_metric(event_type, value = 1)
+    return unless CDO.newrelic_logging
+    NewRelic::Agent.record_metric("Custom/NetSimApi/#{event_type}", value)
+  end


Were you able to test new relic logging locally, and if so could you share what steps were necessary to get it working?

I haven't tested New Relic logging locally yet. Blocked on that right at this moment by their site downtime, but once they're back up I think I found the clue I need here. I think it will involve:

Creating a local dashboard/config/newrelic.yml that sets monitor_mode true in development and has a valid New Relic key (maybe the development one Paul set up for logging build times).

Running the application locally and generating some NetSim traffic.

Checking New Relic to see if any metrics were recorded for the development application.

There's also a local audit log option in newrelic.yml that might be an alternative approach to testing - I'll have to experiment a bit.

Okay, I've been able to verify some of the metrics being sent via the audit log. These were my setup steps:

1. Add newrelic_logging: true to my locals.yml
Otherwise we skip newrelic metrics entirely in our application code.

2. Create dashboard/config/newrelic.yml
Making sure to enable the audit log and monitor_mode in development. Our Chef template newrelic.yml.erb is well-documented and was a big help. Here's mine:

common: &default_settings license_key: '<redacted>' monitor_mode: true developer_mode: false log_level: info audit_log: enabled: false development: <<: *default_settings app_name: Brad Local Development monitor_mode: true developer_mode: false audit_log: enabled: true path: /home/brad/newrelic_audit.log

3. Start the local server and generate some traffic
I used http://localhost-studio.code.org:3000/s/netsim/stage/5/puzzle/1 and made sure to send messages of different sizes (InsertRows and InsertBytes metrics), reset the shard a few times (ResetShard metrics), connect and disconnect from various rooms (check tables other than m) and adjust some router settings on another level (UpdateBytes metrics). Unlike events, metrics get aggregated and sent on intervals by the New Relic agent, so if you want to capture distinct behaviors make sure you separate them by at least 30 seconds or so of downtime.

4. Check the audit log
It's very verbose, so grepping for your metrics in particular is highly recommended. Here are some local results from my change:

╰○ egrep -o "Custom/NetSimApi/[^\"]+\",\"scope\":\"\"},\[[^]]+\]" ~/newrelic_audit.log Custom/NetSimApi/InsertBytes_n","scope":""},[2,198.0,198.0,38.0,160.0,27044.0] Custom/NetSimApi/InsertRows_n","scope":""},[2,2.0,2.0,1.0,1.0,2.0] Custom/NetSimApi/InsertBytes_w","scope":""},[1,129.0,129.0,129.0,129.0,16641.0] Custom/NetSimApi/InsertRows_w","scope":""},[1,1.0,1.0,1.0,1.0,1.0] Custom/NetSimApi/InsertBytes_m","scope":""},[2,338.0,338.0,168.0,170.0,57124.0] Custom/NetSimApi/InsertRows_m","scope":""},[2,2.0,2.0,1.0,1.0,2.0] Custom/NetSimApi/InsertBytes_l","scope":""},[1,159.0,159.0,159.0,159.0,25281.0] Custom/NetSimApi/InsertRows_l","scope":""},[1,1.0,1.0,1.0,1.0,1.0] Custom/NetSimApi/InsertBytes_m","scope":""},[6,1012.0,1012.0,143.0,194.0,173098.0] Custom/NetSimApi/InsertRows_m","scope":""},[6,6.0,6.0,1.0,1.0,6.0] Custom/NetSimApi/InsertBytes_l","scope":""},[3,476.0,476.0,134.0,183.0,76726.0] Custom/NetSimApi/InsertRows_l","scope":""},[3,3.0,3.0,1.0,1.0,3.0] Custom/NetSimApi/InsertBytes_m","scope":""},[28,23940.0,23940.0,164.0,971.0,22690528.0] Custom/NetSimApi/InsertRows_m","scope":""},[28,28.0,28.0,1.0,1.0,28.0] Custom/NetSimApi/InsertBytes_l","scope":""},[14,11830.0,11830.0,155.0,960.0,11107250.0] Custom/NetSimApi/InsertRows_l","scope":""},[14,14.0,14.0,1.0,1.0,14.0] Custom/NetSimApi/InsertBytes_m","scope":""},[64,8686.0,8686.0,126.0,137.0,1179074.0] Custom/NetSimApi/InsertRows_m","scope":""},[64,64.0,64.0,1.0,1.0,64.0] Custom/NetSimApi/InsertBytes_l","scope":""},[32,4023.0,4023.0,117.0,126.0,505845.0] Custom/NetSimApi/InsertRows_l","scope":""},[32,32.0,32.0,1.0,1.0,32.0] Custom/NetSimApi/InsertBytes_n","scope":""},[2,194.0,194.0,38.0,156.0,25780.0] Custom/NetSimApi/InsertRows_n","scope":""},[2,2.0,2.0,1.0,1.0,2.0] Custom/NetSimApi/InsertBytes_w","scope":""},[1,136.0,136.0,136.0,136.0,18496.0] Custom/NetSimApi/InsertRows_w","scope":""},[1,1.0,1.0,1.0,1.0,1.0] Custom/NetSimApi/UpdateBytes_n","scope":""},[12,1828.0,1828.0,149.0,157.0,278560.0] Custom/NetSimApi/InsertBytes_m","scope":""},[1,139.0,139.0,139.0,139.0,19321.0] Custom/NetSimApi/InsertRows_m","scope":""},[1,1.0,1.0,1.0,1.0,1.0] Custom/NetSimApi/InsertBytes_l","scope":""},[1,130.0,130.0,130.0,130.0,16900.0] Custom/NetSimApi/InsertRows_l","scope":""},[1,1.0,1.0,1.0,1.0,1.0] Custom/NetSimApi/InsertBytes_m","scope":""},[9,1239.0,1239.0,135.0,140.0,170605.0] Custom/NetSimApi/InsertRows_m","scope":""},[9,9.0,9.0,1.0,1.0,9.0] Custom/NetSimApi/InsertBytes_l","scope":""},[5,642.0,642.0,126.0,130.0,82452.0] Custom/NetSimApi/InsertRows_l","scope":""},[5,5.0,5.0,1.0,1.0,5.0] Custom/NetSimApi/UpdateBytes_n","scope":""},[2,305.0,305.0,151.0,154.0,46517.0] Custom/NetSimApi/InsertBytes_n","scope":""},[12,1170.0,1170.0,35.0,160.0,160950.0] Custom/NetSimApi/InsertRows_n","scope":""},[12,12.0,12.0,1.0,1.0,12.0] Custom/NetSimApi/ResetShard","scope":""},[4,4.0,4.0,1.0,1.0,4.0] Custom/NetSimApi/InsertBytes_w","scope":""},[1,127.0,127.0,127.0,127.0,16129.0] Custom/NetSimApi/InsertRows_w","scope":""},[1,1.0,1.0,1.0,1.0,1.0] Custom/NetSimApi/InsertBytes_m","scope":""},[2,330.0,330.0,164.0,166.0,54452.0] Custom/NetSimApi/InsertRows_m","scope":""},[2,2.0,2.0,1.0,1.0,2.0] Custom/NetSimApi/InsertBytes_l","scope":""},[1,152.0,152.0,152.0,152.0,23104.0] Custom/NetSimApi/InsertRows_l","scope":""},[1,1.0,1.0,1.0,1.0,1.0] Custom/NetSimApi/InsertBytes_m","scope":""},[16,14748.0,14748.0,200.0,1499.0,18327268.0] Custom/NetSimApi/InsertRows_m","scope":""},[16,16.0,16.0,1.0,1.0,16.0] Custom/NetSimApi/InsertBytes_l","scope":""},[8,7270.0,7270.0,188.0,1485.0,8973254.0] Custom/NetSimApi/InsertRows_l","scope":""},[8,8.0,8.0,1.0,1.0,8.0] Custom/NetSimApi/ResetShard","scope":""},[1,1.0,1.0,1.0,1.0,1.0]

I'm not sure of the exact meanings of the array of values, but they appear to be: [count, total, total, min, max, sum_of_squares]. The values match the ranges I'd expect for various events.

5. Check New Relic's Dashboard
My local machine is definitely reporting to New Relic - it shows up as an application in the APM tool:

However, I haven't been able to find any of my new metrics in the metrics explorer. I'm wondering if there's a delay for that data being populated or if their recent site issues are affecting anything, or if there's just another configuration setting I've missed. In any case, the audit log makes me confident that we'll get these metrics all the way through the system once this change is in prod.

Hamms

What does hitting this limit look like to a user?

davidsbailey · 2017-10-05T19:32:18Z

shared/middleware/net_sim_api.rb

+    if body_string.length > MAX_REQUEST_SIZE
+      record_metric("UpdateTooLarge_#{table_name}", body_string.length)
+      too_large
+    end


This approach may turn out to be hard to monitor. The easiest two ways I know of to monitor new relic metrics are via custom dashboards:

old way: https://rpm.newrelic.com/accounts/501463/custom_dashboards/pages/7067

new way (Insights): https://insights.newrelic.com/accounts/501463/dashboards/407434

The problem is that both prompt you to specify the custom metric by name, so if your metric name varies, then I'm not sure how you'll find it. Some possible solutions:

make this a custom event instead of a custom metric, which lets you include as many fields as you want, and can be searched using the events explorer (though may not keep data as long)

use the metrics explorer and rely on the autocomplete there to show which metrics have values (e.g. metrics explorer for Custom/FilesApi)

also increment a counter that shows you the total number of InsertTooLarge and UpdateTooLarge events, add that a dashboard, and only drill down when that number goes above zero

if you know all the table_names and there aren't too many, you could add each #{metric_name}_#{table_name} combination to a single dashboard

I messed with the metrics explorer for a bit to understand how we'd eventually use this data, and it seemed usable re: autocomplete and exploring what's been logged. In any case, there are exactly four table names in NetSim:

n for Nodes

w for Wires

m for Messages

l for Logs

I'm assuming I'll set up a dashboard that monitors each metric across each of these four tables, and I figured the granularity would help us grok bad behaviors if it happens again - for example, I expect the message table to be the source of most of our problems, but if we start seeing huge requests to the node or wire tables that suggests a bad actor hitting our API without using the NetSim client code.

We might find it useful to implement stricter request size limits on some tables than others. If we eventually feel like we've got this locked down, maybe we'll remove the table names and consolidate these metrics.

So in total I'm introducing 21 new metrics:

(1) Custom/NetSimApi/ResetShard

(4) Custom/NetSimApi/InsertBytes_(n|w|m|l)

(4) Custom/NetSimApi/InsertRows_(n|w|m|l)

(4) Custom/NetSimApi/UpdateBytes_(n|w|m|l)

(4) Custom/NetSimApi/InsertTooLarge_(n|w|m|l)

(4) Custom/NetSimApi/UpdateTooLarge_(n|w|m|l)

Ahh ok, the parameterized custom metric names make great sense then. Those should all fit nicely in a dashboard or two.

davidsbailey

LGTM after considering how you plan to monitor the custom metrics with variable names.

islemaster · 2017-10-05T21:00:23Z

User experience when hitting one of these errors is that the "Send" box clears, and the following toast appears:

Which is, of course, not good advice in this case, but all told it's a pretty graceful failure and the user can go back to using the simulation normally.

A client inspecting network traffic would see the 413 responses to their large POST requests with fairly unambiguous messaging:

Tested by tuning the request body size limit down from 5MB to 500B and then sending 5000+ character messages in the simulator.

davidsbailey · 2017-10-05T23:14:34Z

I've sometimes seen a 30-60second delay in metrics and events showing up, but never more.

There is also the separate new relic "account" we have configured for development: https://rpm.newrelic.com/accounts/501463/applications/17634241 it might be possible to get your events to show up there. I believe you have to use a different api key.

The other thing I've done is patched my changes into production-console, and then fired up bin/production-console, and run commands there to trigger the events there. This is not ideal though.

davidsbailey · 2017-10-05T23:49:24Z

Hey Brad, it sounds like you're a step ahead of me and already got stuff showing up on the new relic site. If you've got that far then I'm not sure my last suggestions will help. Strange that they would show up in one part of the site but not the other.

islemaster · 2017-10-06T00:09:07Z

Finally got metrics to show up in Insights. Not sure what did it - I saw something in the docs saying you might need to have your agent running up to ten minutes for initial metrics to get collected, and a lot of other stuff saying the Metrics Explorer autocomplete just isn't that great. Anyway, eventually was able to generate some results:

We'll be able to assemble these into a basic dashboard and learn something about our NetSim traffic.

islemaster force-pushed the netsim-limits branch from a2d80b7 to 6e99e79 Compare October 4, 2017 22:23

islemaster added 4 commits October 4, 2017 16:21

Request size limits in NetSimApi

ef9f64f

Apply an initial request size limit of 5MB to insert and update operations in NetSimApi. This is probably too high, but I don't want to lower it until we've gathered some metrics on normal request sizes.

Test size limit on insert

a6302c9

Test size limit on update

ee0e8c9

islemaster force-pushed the netsim-limits branch from 6e99e79 to ee0e8c9 Compare October 4, 2017 23:21

islemaster requested review from Hamms, davidsbailey and wjordan October 5, 2017 18:39

davidsbailey reviewed Oct 5, 2017

View reviewed changes

Hamms approved these changes Oct 5, 2017

View reviewed changes

davidsbailey reviewed Oct 5, 2017

View reviewed changes

davidsbailey approved these changes Oct 5, 2017

View reviewed changes

wjordan approved these changes Oct 6, 2017

View reviewed changes

islemaster merged commit 7965845 into staging Oct 6, 2017

islemaster deleted the netsim-limits branch October 6, 2017 18:33

islemaster mentioned this pull request Oct 11, 2017

Internet Simulator: Reduce request size limit from 5MB to 50KB #18314

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internet Simulator: Monitor and limit request payload sizes #18142

Internet Simulator: Monitor and limit request payload sizes #18142

islemaster commented Oct 4, 2017 •

edited

davidsbailey Oct 5, 2017

islemaster Oct 5, 2017 •

edited

islemaster Oct 5, 2017 •

edited

Hamms left a comment

davidsbailey Oct 5, 2017

islemaster Oct 5, 2017

islemaster Oct 5, 2017

davidsbailey Oct 5, 2017

davidsbailey left a comment

islemaster commented Oct 5, 2017

davidsbailey commented Oct 5, 2017

davidsbailey commented Oct 5, 2017

islemaster commented Oct 6, 2017

Internet Simulator: Monitor and limit request payload sizes #18142

Internet Simulator: Monitor and limit request payload sizes #18142

Conversation

islemaster commented Oct 4, 2017 • edited

The Monday Incident

Mitigating the vulnerability

davidsbailey Oct 5, 2017

Choose a reason for hiding this comment

islemaster Oct 5, 2017 • edited

Choose a reason for hiding this comment

islemaster Oct 5, 2017 • edited

Choose a reason for hiding this comment

Hamms left a comment

Choose a reason for hiding this comment

davidsbailey Oct 5, 2017

Choose a reason for hiding this comment

islemaster Oct 5, 2017

Choose a reason for hiding this comment

islemaster Oct 5, 2017

Choose a reason for hiding this comment

davidsbailey Oct 5, 2017

Choose a reason for hiding this comment

davidsbailey left a comment

Choose a reason for hiding this comment

islemaster commented Oct 5, 2017

davidsbailey commented Oct 5, 2017

davidsbailey commented Oct 5, 2017

islemaster commented Oct 6, 2017

islemaster commented Oct 4, 2017 •

edited

islemaster Oct 5, 2017 •

edited

islemaster Oct 5, 2017 •

edited