Speed up v3 security groups #2255

andy-paine · 2021-05-04T15:03:12Z

Summary

For users with large numbers of space visibilities (1000+), the query string in SecurityGroupPresenter::space_guid_hash_for was getting extremely large as it does a SELECT ... WHERE IN (<list of visible space guids>) for each security group. As the goal of this query is to effectively find which spaces the security group is applied to, the number of returned rows is strictly less than or equal to the size of the visible space guid lists. By fetching the list of all space guids a security group is applied to and filtering in Ruby, we avoid having to serialize the large visible space list and create + send the large query (sometimes multiple MB in length!) across to the DB.

Best case scenario (for this change)

Many spaces cause the old query string to be very large
Most spaces do not have a security group applied (not many rows in security_groups_spaces) so few rows are returned

This should be a big improvement as the extra cost of doing include? in the Cloud Controller should be neglible when only a few rows are returned

Using:

100k orgs
100k spaces
40k security groups
20k security group spaces

Timings:

Before: cf curl /v3/security_groups - 68s
After: cf curl /v3/security_groups - 0.37s
Before: cf curl /v3/security_groups?per_page=500 - Timed out after 30 minutes (watching the DB it was doing some weird stuff)
After: cf curl /v3/security_groups?per_page=500 - 1.74s

Worst case scenario (for this change)

Fewer spaces so query string not as large
Every space has every security group applied (many rows in security_groups_spaces) so lots of rows are returned

This could be a significant downgrade if the include? is expensive due to lots of rows being returned as the filtering is no longer done in the DB

Using:

2k orgs
2k spaces
500 security groups
1 million security group spaces (one for every space x security group pair)

NOTE: timings aren't comparable to above as there is significantly less overall data otherwise the measurements would have to be in hours as there would be 4 billion security_groups_spaces rows!

Timings:

Before: cf curl /v3/security_groups - 8.1s
After: cf curl /v3/security_groups - 4.6s
Before: cf curl /v3/security_groups?per_page=500 - 96s
After: cf curl /v3/security_groups?per_page=500 - 48s

Initially there appeared to be no significant difference between approaches however by including a select(:guid) in the query, this reduced the size of the returned dataset enough that the new query outperformed the old one even in this scenario.

Conclusion

This adjusted query seems to be an improvement in both the worst and best case scenario for this change. I would also suggest that the "best case scenario" above is much more realistic for a real world scenario of having many spaces but only a few security groups per space.

I have reviewed the contributing guide
I have viewed, signed, and submitted the Contributor License Agreement
I have made this pull request to the main branch
I have run all the unit tests using bundle exec rake
I have run CF Acceptance Tests

For users with access to a large number of spaces, the SQL query string for checking which spaces a security group applies to gets quite big (sometimes >1MB). By grabbing all the spaces a security group applies to and doing the filtering in Ruby, this query string is much smaller. The potential extra cost of running the `include?` seems to be offset by the saving in creating/sending the large query string and using the `select(:guid)` also saves some data transfer to further improve the latency of this query.

sethboyles · 2021-05-05T22:51:19Z

Thanks for the PR! This all makes sense and I will merge shortly.

Testing out the latter (worst) case on our environment, we saw a slight decrease in response time, from ~2m28s to ~2m9s

FWIW I experimented a little with the latter case, replacing dataset.select(:guid).all with dataset.select_map(:guid), the theory being that select_map avoids constructing a Ruby object per Space record, and instead simply returns an array, but only saw a minimal decrease in response time (from 2m9s -> 2m4s)

Additionally, I tried using eager loading, which batches all 2000 spaces into one query. Later in the presenter would have to be changed to use the already loaded array of Space records rather than a dataset. The queries themselves would execute quickly, but the Sequel ORM still takes a long time to stitch the 1 million associations together. The request ended up being longer than 3 minutes overall. I think eager loading might have had potential for your first case scenario, but given the improvement already shown I didn't try it out.

cf-gitbot added the unscheduled label May 4, 2021

andy-paine force-pushed the speed-up-v3-security-groups branch from 51c8d33 to 0a7353a Compare May 4, 2021 15:07

sethboyles merged commit 0c1374d into cloudfoundry:main May 5, 2021

andy-paine mentioned this pull request Jun 24, 2021

Replace readable_org_guids_for_domains with query #2353

Merged

5 tasks

This was referenced Aug 25, 2021

Speed up security groups #2475

Merged

Optimize json encoding #2481

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up v3 security groups #2255

Speed up v3 security groups #2255

andy-paine commented May 4, 2021 •

edited

sethboyles commented May 5, 2021 •

edited

Speed up v3 security groups #2255

Speed up v3 security groups #2255

Conversation

andy-paine commented May 4, 2021 • edited

Summary

Best case scenario (for this change)

Worst case scenario (for this change)

Conclusion

sethboyles commented May 5, 2021 • edited

andy-paine commented May 4, 2021 •

edited

sethboyles commented May 5, 2021 •

edited