Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write Session Data to Redis #57562

Merged
merged 16 commits into from Apr 9, 2024
Merged

Write Session Data to Redis #57562

merged 16 commits into from Apr 9, 2024

Conversation

Hamms
Copy link
Contributor

@Hamms Hamms commented Mar 26, 2024

Notably, however, do not yet begin relying on Redis for reading session data.

We want to get an accurate evaluation of what real-world usage of this new Redis cluster will look like, and also benefit from some load testing in advance of an actual switchover. To that end, this PR proposes that we begin writing all session data to Redis as a side effect of using the existing cookie-based store. This new functionality will be gated behind a DCDO flag and wrapped in error handling logic to minimize the chances of this experiment affecting the user experience.

Also remove the instrumented cookie store; we don't need to manually gather data for estimates now that we'll have an actual Redis cluster to monitor.

Links

Follow-up to #57558

Testing story

Tested the Redis cluster provisioned in production by SSHing into a frontend instance and executing some manual queries from the dashboard console:

[production] dashboard > client = Redis.new(url: "rediss://[redacted].cache.amazonaws.com:6379")
=> #<Redis client v3.3.3 for redis://[redacted].cache.amazonaws.com:6379/0>
[production] dashboard > client.get('foo')
=> nil
[production] dashboard > client.keys
=> []

Tested the functionality added in this PR by temporarily modifying cloud_formation_stack.yml.erb to provision a redis session storage cluster even in an adhoc environment, then verified on an adhoc that registration, authentication, and session destruction all work as expected, and that I can successfully find session data in the redis store.

Deployment strategy

We should plan to monitor Honeybadger and Zendesk after this DTPs, and have a revert prepped just in case.

After deployment, we should plan to set the DCDO flag mirror_session_in_redis_enabled to true so we can begin collecting data.

Follow-up work

Assuming that our new Redis cluster is able to handle the load generated by this mirroring, we can follow this up with an implementation that will begin using the Redis store rather than cookies as the source of truth.

PR Checklist:

  • Tests provide adequate coverage
  • Privacy and Security impacts have been assessed
  • Code is well-commented
  • New features are translatable or updates will not break translations
  • Relevant documentation has been added or updated
  • User impact is well-understood and desirable
  • Pull Request is labeled appropriately
  • Follow-up work items (including potential tech debt) are tracked and linked

The first step in our efforts to transfer user session data from cookies
to a centralized data store is to provision the data store in
Cloudformation, so we can begin writing data to it.

We have quite a few options for how exactly we want to provision this,
but broadly speaking we can choose either Serverless or On-Demand nodes,
and if we choose On-Demand we also must choose both the size and
quantity of nodes we want.

According to Google Analytics, we've had 28M users over the last 90
days, and 84M over the last year, which works out to between 45 and 60
million users on average over an arbitrary 200-day period (200 days
being our current session duration configuration). At ~300 bytes per
session on average, that's 13.5-18 GB of data we'd ideally like to
store.

Storing 18GB of data in Severless ElastiCache would cost ~$1620/month;
even 13GB would still be ~$1170/month.  Alternatively, we could use
three of the smallest memory-optimized cache node (cache.r7g.large) and
get 39.21GB of storage for ~$480/month, or even just use two for 26.14GB
of storage at ~$320/month. An even more targeted solution could be to
use a larger number of smaller nodes; 6 cache.t4g.medium nodes would
allow us to store 18.5GB for ~$280/month.

Throughput is less of an issue for us than storage. According to
CloudWatch, we average ~800M requests per week, 15M/hour, and under
4,200/second. At an average of 300 bytes stored per session and one GET
plus one PUT per request, that's 2.5 Gigabits of needed network
performance for on-demand nodes (well under the 12.5 promised by
cache.r7g.large nodes), or ~$25/month in ECPUs for a Serverless model.

This PR proposes we provision a single cluster consisting of three
cache.r7g.large nodes, with encryption, automatic failover, and
multi-zone availability all enabled. That should give us plenty of room
for growth without too much unnecessary expense. The next step in our
broader efforts to enable this feature is to begin writing data to redis
so we can get a more accurate idea of actual usage, at which point we
can revisit this decision.
Notably, however, do not yet begin relying on Redis for *reading* session data.

We want to get an accurate evaluation of what real-world usage of this
new Redis cluster will look like, and also benefit from some load
testing in advance of an actual switchover. To that end, this PR
proposes that we begin writing all session data to Redis as a side
effect of using the existing cookie-based store. This new functionality
will be gated behind a DCDO flag and wrapped in error handling logic to
minimize the chances of this experiment affecting the user experience.

Follow-up to #57558
Base automatically changed from provision-redis-based-session-store to staging April 2, 2024 18:58
@@ -113,7 +113,7 @@ cat <<JSON > $FIRST_BOOT
},
"cdo-secrets": {
<% if frontends -%>
"session_store_server": "redis://${SessionStoreGroup.PrimaryEndPoint.Address}:${SessionStoreGroup.PrimaryEndPoint.Port}",
"session_store_server": "rediss://${SessionStoreGroup.PrimaryEndPoint.Address}:${SessionStoreGroup.PrimaryEndPoint.Port}",
Copy link
Contributor Author

@Hamms Hamms Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that because we have enabled encryption for the underlying Redis cluster, we have to specifically use a "redis secure" connection here, hence rediss rather than redis.

I very briefly tried to find a way to automatically determine whether or not this is necessary based on the CloudFormation output, but couldn't. I decided to just quickly hardcode it instead; please let me know if you think it would be worth me spending a bit more time on this!

@Hamms Hamms marked this pull request as ready for review April 3, 2024 00:34
@Hamms Hamms requested a review from a team as a code owner April 3, 2024 00:34
@sureshc
Copy link
Contributor

sureshc commented Apr 4, 2024

We don't yet have a standard way to take the Physical ID of a Resource provisioned in CloudFormation and pass that configuration value to our Ruby web application servers.

Specifically, we don't have an easy way to populate CDO.session_store_server from the SessionStore CloudFormation Resource. Our usage of SessionStoreGroup.PrimaryEndPoint.Address and SessionStoreGroup.PrimaryEndPoint.Port in this feature branch only passes those values to the Chef cookbooks that execute when the daemon instance is provisioned. Those values don't get persisted to that server nor are they ever passed to web application servers or the console.

The very janky workaround I've come up with is to publish an AWS Stack-specific Secret in the template and then use !StackSecret to populate CDO.session_store_server

Copy link
Contributor

@cat5inthecradle cat5inthecradle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Would there be value in reading the session from Redis in order to evaluate that half of usage too? Or is it pretty much 1:1 with writes?

@Hamms
Copy link
Contributor Author

Hamms commented Apr 4, 2024

Specifically, we don't have an easy way to populate CDO.session_store_server from CloudFormation Logical IDs and Attributes SessionStoreGroup.PrimaryEndPoint.Address and SessionStoreGroup.PrimaryEndPoint.Port

Suresh, can you clarify how the changes to aws/cloudformation/bootstrap_chef_stack.sh.erb are insufficient? Note that the old version of this value has already been populated on frontends:

[production] dashboard > CDO.session_store_server
=> "redis://[redacted].cache.amazonaws.com:6379"

Is it just a problem with updating the value once it's been initially set? My expectation is that whenever we spin up a new frontend, it will pick up these changes.

Would there be value in reading the session from Redis in order to evaluate that half of usage too? Or is it pretty much 1:1 with writes?

In my local testing, it seems like session reads and writes are in fact exactly 1:1; every single request incurs exactly one write and one read. I'm certainly not opposed to adding some logic to also read just for thoroughness, but I think just doubling the number of operations generated as a result of this change should accurately capture the number of operations we'll need for full usage.

@cat5inthecradle
Copy link
Contributor

We don't yet have a standard way to take the Physical ID of a Resource provisioned in CloudFormation and pass that configuration value to our Ruby web application servers.

Specifically, we don't have an easy way to populate CDO.session_store_server from the SessionStore CloudFormation Resource. Our usage of SessionStoreGroup.PrimaryEndPoint.Address and SessionStoreGroup.PrimaryEndPoint.Port in this feature branch only passes those values to the Chef cookbooks that execute when the daemon instance is provisioned. Those values don't get persisted to that server nor are they ever passed to web application servers or the console.

The very janky workaround I've come up with is to publish an AWS Stack-specific Secret in the template and then use !StackSecret to populate CDO.session_store_server

session_store_server should get set correctly on frontends with this change - the bootstrap script runs whenever we wake up AMI Builder.

@Hamms
Copy link
Contributor Author

Hamms commented Apr 5, 2024

I'll also plan to resolve https://app.honeybadger.io/projects/3240/faults/106697134 once this PR is merged, since it removes the relevant code.

Copy link
Contributor

@unlox775-code-dot-org unlox775-code-dot-org left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@Hamms Hamms merged commit 9e3fdab into staging Apr 9, 2024
2 checks passed
@Hamms Hamms deleted the write-to-redis-based-session-store branch April 9, 2024 19:22
Hamms added a commit that referenced this pull request Apr 11, 2024
Follow-up to #57562,
which started writing session data to Redis with the goal of giving us a
chance to analyze real-world usage in advance of a switch to actually
start using Redis as the source of truth.

For the most part, that's working, but there are currently a couple of
issues with that implementation which both affect the total amount of
data being stored, and render that information inaccurate.

First, we aren't deleting old sessions from Redis, so aren't capturing
the extent to which users logging out will reduce total data storage. To
resolve that, this PR simply extends the
[`delete_session`](https://github.com/rails/rails/blob/v6.1.7.7/actionpack/lib/action_dispatch/middleware/session/cookie_store.rb#L63-L68)
method to begin mirroring deletions in Redis the same as we mirror
writes.

Second, we're actually writing more data than we need to; I mistakenly
thought we were going to need to manually serialize and deserialize the
Ruby Hashes that our session data is stored in, and chose to do so with
`Marshal`, which is much less space-efficient than simply passing the
Hash directly to Redis to store. In my local testing, a Hash for a
freshly-logged-in session is 168 bytes in memory, but the `Marshal`
serialization takes 424 bytes.

Simplifying the write logic and enabling deletion logic should improve
the accuracy of our metrics.
Hamms added a commit that referenced this pull request Apr 12, 2024
Follow-up to #57562,
which started writing session data to Redis with the goal of giving us a
chance to analyze real-world usage in advance of a switch to actually
start using Redis as the source of truth.

For the most part, that's working, but there are currently a couple of
issues with that implementation which both affect the total amount of
data being stored, and render that information inaccurate.

First, we aren't deleting old sessions from Redis, so aren't capturing
the extent to which users logging out will reduce total data storage. To
resolve that, this PR simply extends the
[`delete_session`](https://github.com/rails/rails/blob/v6.1.7.7/actionpack/lib/action_dispatch/middleware/session/cookie_store.rb#L63-L68)
method to begin mirroring deletions in Redis the same as we mirror
writes.

Second, we're actually writing more data than we need to; I mistakenly
thought we were going to need to manually serialize and deserialize the
Ruby Hashes that our session data is stored in, and chose to do so with
`Marshal`, which is much less space-efficient than simply passing the
Hash directly to Redis to store. In my local testing, a Hash for a
freshly-logged-in session is 168 bytes in memory, but the `Marshal`
serialization takes 424 bytes.

Simplifying the write logic and enabling deletion logic should improve
the accuracy of our metrics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants