Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internet Simulator: Monitor and limit request payload sizes #18142

Merged
merged 4 commits into from Oct 6, 2017

Conversation

islemaster
Copy link
Contributor

@islemaster islemaster commented Oct 4, 2017

On Monday we saw brief downtime as server threads got tied up with very slow NetSimApi requests. After some investigation, we believe the root cause is a set of very large records in a single classroom simulation. This PR introduces metrics for NetSimApi request sizes and some simple limits on insert and update operations to make it less likely that this will happen again.

The Monday Incident

image
The NetSim Redis cache normally stays below 50MB, because most simulation content is very compact (messages are base-64 encoded in flight and in storage) and the simulation aggressively cleans up after itself. On Monday, the cache climbed to more than 1.2GB in the space of about 10 minutes. New Relic did not identify an increased throughput rate during this period, so I think this represents a normal number of requests with oversized payloads.

The large cache size caused significant slowdown to all read requests for the affected simulation - and the records we've found show very slow requests to only one simulation - because it took a long time to get results from the Redis server back to the frontend serving the request. As available frontend threads got tied up with these long Redis read operations we saw a cascading effect leading to brief site downtime.

Mitigating the vulnerability

We've had some discussions internally about insulating our site infrastructure from NetSim, but there are also NetSim-specific steps we can take to make this less likely to happen again. As a first step in this PR I am implementing a 5MB payload limit on insert and update requests (we impose a similar limit on file uploads). I'm also logging some metrics to New Relic to allow us to monitor real payload sizes. It's likely 5MB is much larger than this limit needs to be, and the metrics will allow us to fine-tune that value for a realistic usage scenario. Still, it should be a little harder to fill up a single simulation with a size limit in place.

Other steps I'd like to take include:

  • Making sure redis-rb respects its 5-second timeout on requests to Redis
    This would hurt the affected simulation a bit sooner, but would make it much less likely that one bad simulation could affect the rest of the site.
  • Resetting a simulation that's grown to an unusable size.
  • Optimizing read operations to avoid the HGETALL command when possible.

Instruments some custom New Relic metrics in NetSimApi:

- `Custom/NetSimApi/ResetShard` just for tracking how often this gets used.
- `Custom/NetSimApi/InsertBytes_*` for tracking size of insert requests (per table).
- `Custom/NetSimApi/InsertRows_*` for getting a sense of how multi-insert is being used (per table).
- `Custom/NetSimApi/UpdateBytes_*` for tracking size of update requests (per table).
Apply an initial request size limit of 5MB to insert and update operations in NetSimApi.  This is probably too high, but I don't want to lower it until we've gathered some metrics on normal request sizes.
def record_metric(event_type, value = 1)
return unless CDO.newrelic_logging
NewRelic::Agent.record_metric("Custom/NetSimApi/#{event_type}", value)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you able to test new relic logging locally, and if so could you share what steps were necessary to get it working?

Copy link
Contributor Author

@islemaster islemaster Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested New Relic logging locally yet. Blocked on that right at this moment by their site downtime, but once they're back up I think I found the clue I need here. I think it will involve:

  1. Creating a local dashboard/config/newrelic.yml that sets monitor_mode true in development and has a valid New Relic key (maybe the development one Paul set up for logging build times).
  2. Running the application locally and generating some NetSim traffic.
  3. Checking New Relic to see if any metrics were recorded for the development application.

There's also a local audit log option in newrelic.yml that might be an alternative approach to testing - I'll have to experiment a bit.

Copy link
Contributor Author

@islemaster islemaster Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've been able to verify some of the metrics being sent via the audit log. These were my setup steps:

1. Add newrelic_logging: true to my locals.yml
Otherwise we skip newrelic metrics entirely in our application code.

2. Create dashboard/config/newrelic.yml
Making sure to enable the audit log and monitor_mode in development. Our Chef template newrelic.yml.erb is well-documented and was a big help. Here's mine:

common: &default_settings
  license_key: '<redacted>'
  monitor_mode: true
  developer_mode: false
  log_level: info
  audit_log:
    enabled: false
development:
  <<: *default_settings
  app_name: Brad Local Development
  monitor_mode: true
  developer_mode: false
  audit_log:
    enabled: true
    path: /home/brad/newrelic_audit.log

3. Start the local server and generate some traffic
I used http://localhost-studio.code.org:3000/s/netsim/stage/5/puzzle/1 and made sure to send messages of different sizes (InsertRows and InsertBytes metrics), reset the shard a few times (ResetShard metrics), connect and disconnect from various rooms (check tables other than m) and adjust some router settings on another level (UpdateBytes metrics). Unlike events, metrics get aggregated and sent on intervals by the New Relic agent, so if you want to capture distinct behaviors make sure you separate them by at least 30 seconds or so of downtime.

4. Check the audit log
It's very verbose, so grepping for your metrics in particular is highly recommended. Here are some local results from my change:

╰○ egrep -o "Custom/NetSimApi/[^\"]+\",\"scope\":\"\"},\[[^]]+\]" ~/newrelic_audit.log
Custom/NetSimApi/InsertBytes_n","scope":""},[2,198.0,198.0,38.0,160.0,27044.0]
Custom/NetSimApi/InsertRows_n","scope":""},[2,2.0,2.0,1.0,1.0,2.0]
Custom/NetSimApi/InsertBytes_w","scope":""},[1,129.0,129.0,129.0,129.0,16641.0]
Custom/NetSimApi/InsertRows_w","scope":""},[1,1.0,1.0,1.0,1.0,1.0]
Custom/NetSimApi/InsertBytes_m","scope":""},[2,338.0,338.0,168.0,170.0,57124.0]
Custom/NetSimApi/InsertRows_m","scope":""},[2,2.0,2.0,1.0,1.0,2.0]
Custom/NetSimApi/InsertBytes_l","scope":""},[1,159.0,159.0,159.0,159.0,25281.0]
Custom/NetSimApi/InsertRows_l","scope":""},[1,1.0,1.0,1.0,1.0,1.0]
Custom/NetSimApi/InsertBytes_m","scope":""},[6,1012.0,1012.0,143.0,194.0,173098.0]
Custom/NetSimApi/InsertRows_m","scope":""},[6,6.0,6.0,1.0,1.0,6.0]
Custom/NetSimApi/InsertBytes_l","scope":""},[3,476.0,476.0,134.0,183.0,76726.0]
Custom/NetSimApi/InsertRows_l","scope":""},[3,3.0,3.0,1.0,1.0,3.0]
Custom/NetSimApi/InsertBytes_m","scope":""},[28,23940.0,23940.0,164.0,971.0,22690528.0]
Custom/NetSimApi/InsertRows_m","scope":""},[28,28.0,28.0,1.0,1.0,28.0]
Custom/NetSimApi/InsertBytes_l","scope":""},[14,11830.0,11830.0,155.0,960.0,11107250.0]
Custom/NetSimApi/InsertRows_l","scope":""},[14,14.0,14.0,1.0,1.0,14.0]
Custom/NetSimApi/InsertBytes_m","scope":""},[64,8686.0,8686.0,126.0,137.0,1179074.0]
Custom/NetSimApi/InsertRows_m","scope":""},[64,64.0,64.0,1.0,1.0,64.0]
Custom/NetSimApi/InsertBytes_l","scope":""},[32,4023.0,4023.0,117.0,126.0,505845.0]
Custom/NetSimApi/InsertRows_l","scope":""},[32,32.0,32.0,1.0,1.0,32.0]
Custom/NetSimApi/InsertBytes_n","scope":""},[2,194.0,194.0,38.0,156.0,25780.0]
Custom/NetSimApi/InsertRows_n","scope":""},[2,2.0,2.0,1.0,1.0,2.0]
Custom/NetSimApi/InsertBytes_w","scope":""},[1,136.0,136.0,136.0,136.0,18496.0]
Custom/NetSimApi/InsertRows_w","scope":""},[1,1.0,1.0,1.0,1.0,1.0]
Custom/NetSimApi/UpdateBytes_n","scope":""},[12,1828.0,1828.0,149.0,157.0,278560.0]
Custom/NetSimApi/InsertBytes_m","scope":""},[1,139.0,139.0,139.0,139.0,19321.0]
Custom/NetSimApi/InsertRows_m","scope":""},[1,1.0,1.0,1.0,1.0,1.0]
Custom/NetSimApi/InsertBytes_l","scope":""},[1,130.0,130.0,130.0,130.0,16900.0]
Custom/NetSimApi/InsertRows_l","scope":""},[1,1.0,1.0,1.0,1.0,1.0]
Custom/NetSimApi/InsertBytes_m","scope":""},[9,1239.0,1239.0,135.0,140.0,170605.0]
Custom/NetSimApi/InsertRows_m","scope":""},[9,9.0,9.0,1.0,1.0,9.0]
Custom/NetSimApi/InsertBytes_l","scope":""},[5,642.0,642.0,126.0,130.0,82452.0]
Custom/NetSimApi/InsertRows_l","scope":""},[5,5.0,5.0,1.0,1.0,5.0]
Custom/NetSimApi/UpdateBytes_n","scope":""},[2,305.0,305.0,151.0,154.0,46517.0]
Custom/NetSimApi/InsertBytes_n","scope":""},[12,1170.0,1170.0,35.0,160.0,160950.0]
Custom/NetSimApi/InsertRows_n","scope":""},[12,12.0,12.0,1.0,1.0,12.0]
Custom/NetSimApi/ResetShard","scope":""},[4,4.0,4.0,1.0,1.0,4.0]
Custom/NetSimApi/InsertBytes_w","scope":""},[1,127.0,127.0,127.0,127.0,16129.0]
Custom/NetSimApi/InsertRows_w","scope":""},[1,1.0,1.0,1.0,1.0,1.0]
Custom/NetSimApi/InsertBytes_m","scope":""},[2,330.0,330.0,164.0,166.0,54452.0]
Custom/NetSimApi/InsertRows_m","scope":""},[2,2.0,2.0,1.0,1.0,2.0]
Custom/NetSimApi/InsertBytes_l","scope":""},[1,152.0,152.0,152.0,152.0,23104.0]
Custom/NetSimApi/InsertRows_l","scope":""},[1,1.0,1.0,1.0,1.0,1.0]
Custom/NetSimApi/InsertBytes_m","scope":""},[16,14748.0,14748.0,200.0,1499.0,18327268.0]
Custom/NetSimApi/InsertRows_m","scope":""},[16,16.0,16.0,1.0,1.0,16.0]
Custom/NetSimApi/InsertBytes_l","scope":""},[8,7270.0,7270.0,188.0,1485.0,8973254.0]
Custom/NetSimApi/InsertRows_l","scope":""},[8,8.0,8.0,1.0,1.0,8.0]
Custom/NetSimApi/ResetShard","scope":""},[1,1.0,1.0,1.0,1.0,1.0]

I'm not sure of the exact meanings of the array of values, but they appear to be: [count, total, total, min, max, sum_of_squares]. The values match the ranges I'd expect for various events.

5. Check New Relic's Dashboard
My local machine is definitely reporting to New Relic - it shows up as an application in the APM tool:
image

However, I haven't been able to find any of my new metrics in the metrics explorer. I'm wondering if there's a delay for that data being populated or if their recent site issues are affecting anything, or if there's just another configuration setting I've missed. In any case, the audit log makes me confident that we'll get these metrics all the way through the system once this change is in prod.

Copy link
Contributor

@Hamms Hamms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does hitting this limit look like to a user?

if body_string.length > MAX_REQUEST_SIZE
record_metric("UpdateTooLarge_#{table_name}", body_string.length)
too_large
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach may turn out to be hard to monitor. The easiest two ways I know of to monitor new relic metrics are via custom dashboards:

The problem is that both prompt you to specify the custom metric by name, so if your metric name varies, then I'm not sure how you'll find it. Some possible solutions:

  1. make this a custom event instead of a custom metric, which lets you include as many fields as you want, and can be searched using the events explorer (though may not keep data as long)
  2. use the metrics explorer and rely on the autocomplete there to show which metrics have values (e.g. metrics explorer for Custom/FilesApi)
  3. also increment a counter that shows you the total number of InsertTooLarge and UpdateTooLarge events, add that a dashboard, and only drill down when that number goes above zero
  4. if you know all the table_names and there aren't too many, you could add each #{metric_name}_#{table_name} combination to a single dashboard

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I messed with the metrics explorer for a bit to understand how we'd eventually use this data, and it seemed usable re: autocomplete and exploring what's been logged. In any case, there are exactly four table names in NetSim:

  • n for Nodes
  • w for Wires
  • m for Messages
  • l for Logs

I'm assuming I'll set up a dashboard that monitors each metric across each of these four tables, and I figured the granularity would help us grok bad behaviors if it happens again - for example, I expect the message table to be the source of most of our problems, but if we start seeing huge requests to the node or wire tables that suggests a bad actor hitting our API without using the NetSim client code.

We might find it useful to implement stricter request size limits on some tables than others. If we eventually feel like we've got this locked down, maybe we'll remove the table names and consolidate these metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in total I'm introducing 21 new metrics:

  • (1) Custom/NetSimApi/ResetShard
  • (4) Custom/NetSimApi/InsertBytes_(n|w|m|l)
  • (4) Custom/NetSimApi/InsertRows_(n|w|m|l)
  • (4) Custom/NetSimApi/UpdateBytes_(n|w|m|l)
  • (4) Custom/NetSimApi/InsertTooLarge_(n|w|m|l)
  • (4) Custom/NetSimApi/UpdateTooLarge_(n|w|m|l)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh ok, the parameterized custom metric names make great sense then. Those should all fit nicely in a dashboard or two.

Copy link
Member

@davidsbailey davidsbailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after considering how you plan to monitor the custom metrics with variable names.

@islemaster
Copy link
Contributor Author

User experience when hitting one of these errors is that the "Send" box clears, and the following toast appears:
image
Which is, of course, not good advice in this case, but all told it's a pretty graceful failure and the user can go back to using the simulation normally.

A client inspecting network traffic would see the 413 responses to their large POST requests with fairly unambiguous messaging:
image
image
image

Tested by tuning the request body size limit down from 5MB to 500B and then sending 5000+ character messages in the simulator.

@davidsbailey
Copy link
Member

I've sometimes seen a 30-60second delay in metrics and events showing up, but never more.

There is also the separate new relic "account" we have configured for development: https://rpm.newrelic.com/accounts/501463/applications/17634241 it might be possible to get your events to show up there. I believe you have to use a different api key.

The other thing I've done is patched my changes into production-console, and then fired up bin/production-console, and run commands there to trigger the events there. This is not ideal though.

@davidsbailey
Copy link
Member

Hey Brad, it sounds like you're a step ahead of me and already got stuff showing up on the new relic site. If you've got that far then I'm not sure my last suggestions will help. Strange that they would show up in one part of the site but not the other.

@islemaster
Copy link
Contributor Author

Finally got metrics to show up in Insights. Not sure what did it - I saw something in the docs saying you might need to have your agent running up to ten minutes for initial metrics to get collected, and a lot of other stuff saying the Metrics Explorer autocomplete just isn't that great. Anyway, eventually was able to generate some results:

image

We'll be able to assemble these into a basic dashboard and learn something about our NetSim traffic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants