Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CloudFront real-time access log processor #41119

Merged
merged 5 commits into from Jun 23, 2021
Merged

CloudFront real-time access log processor #41119

merged 5 commits into from Jun 23, 2021

Conversation

wjordan
Copy link
Contributor

@wjordan wjordan commented Jun 15, 2021

This PR provides a CloudFormation stack that exports CloudFront real-time access logs to Parquet files in an S3 bucket, supporting efficient queries from Athena or Redshift Spectrum.

Links

CloudFront:

Kinesis Firehose:

Jira:

  • INF-381 ('Set up Real-time logs for CloudFront distributions')
  • INF-423 ('Log Pegasus HTTP logs to AWS Athena')

Testing story

Manually tested, via the following steps:

  • Provision CloudFormation stack in dev account
  • Attach the exported configuration to an existing CloudFront distribution, and generate some HTTP requests against it
  • Query the access logs using Athena (via management console), e.g.: SELECT * FROM cdo_access_logs.access_logs WHERE datehour > '2021/06/15' ORDER BY timestamp DESC LIMIT 10
  • Create a test Redshift cluster and query against it:
    • Associate the exported Redshift Spectrum IAM role
    • Manually create the external schema using CREATE EXTERNAL SCHEMA query, e.g., CREATE EXTERNAL SCHEMA cdo_access_logs FROM DATA CATALOG DATABASE 'cdo_access_logs_dev' IAM_ROLE '[role_arn]';
    • Query the access logs using the external schema, e.g.: SELECT * FROM cdo_access_logs.access_logs WHERE datehour > '2021/06/15' LIMIT 10

Deployment strategy

  • Manually create stack in production account
    • NOTE: Packaging and uploading the Lambda functions associated with the stack template requires the use of aws cloudformation package command.
  • Update CloudFront distributions to associate the exported realtime log configuration
  • Monitor metrics to make sure everything is working properly

Follow-up work

  • (eventually) disable classic access logs once they're no longer needed

Privacy

This feature controls the creation of HTTP access logs which contain IP addresses, user-agents and other potentially-sensitive data (depending on the data encoded in paths, query parameters, etc).

Security

  • Access logs are stored in a private S3 bucket with all 'block public access' configuration set, and with at-rest encryption (AWS-managed key)
  • Kinesis data stream is stored with at-rest encryption (AWS-managed key)
  • IAM roles for the various services invoked by this system (cloudfront, firehose, redshift, lambda) grant least privilege to created resources.

PR Checklist:

  • Tests provide adequate coverage
  • Privacy and Security impacts have been assessed
  • Code is well-commented
  • New features are translatable or updates will not break translations
  • Relevant documentation has been added or updated
  • User impact is well-understood and desirable
  • Pull Request is labeled appropriately
  • Follow-up work items (including potential tech debt) are tracked and linked

Exports CloudFront real-time access logs to Parquet files in an S3 bucket,
supporting efficient queries from Athena or Redshift Spectrum.
@wjordan wjordan requested a review from sureshc June 15, 2021 17:22
@wjordan
Copy link
Contributor Author

wjordan commented Jun 23, 2021

Note that merging this PR will update all behaviors in all CloudFront distributions to stream access logs to a single bucket. I've acknowledged a tradeoff in this approach compared to streaming to separate per-environment buckets/prefixes and configuring a separate access-log stream pipeline for each:

Pros:

  • Simplicity (fewer total AWS resources to manage, single service-oriented CloudFormation stack to process all access logs)

Cons:

  • Access logs from different environments will be intermingled in the same log files stored in S3
    • This includes any CDN-enabled adhoc deployments
    • If the assumption is that 'most' traffic will be production traffic, this shouldn't make a big difference in the costs, lifecycle, and security requirements (e.g., treating all of the logs as containing potentially-sensitive information) of the access logs, so shouldn't be a dealbreaker.
  • Simple Athena/Redshift queries against the external table will return logs from all environments
    • It will still possible to filter by individual host header if desired (e.g., WHERE "cs-host" = 'studio.code.org' clause)
    • It will not be possible to filter by environment (e.g., WHERE environment = 'production')

I propose we try out this simple single-stream approach for now, with a plan to revisit the alternative stream-per-environment approach if/when we encounter any issues in practice.

@wjordan wjordan merged commit d05eed3 into staging Jun 23, 2021
@wjordan wjordan deleted the access_logs branch June 23, 2021 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant