-
Notifications
You must be signed in to change notification settings - Fork 10
Archive emails to S3 #627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Archive emails to S3 #627
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
e758af5
Install AWS S3 Gem
kevindew 14b68ad
Service to export email archive data to S3
kevindew 2de0f9a
Add exported_at field for Email Archives table
kevindew a321449
Add finished_sending_at and exported_at indexes to EmailArchives
kevindew cff0055
Task to export email archive data to S3
kevindew 26e2066
EmailArchivePresenter class
kevindew 3763814
Make EmailArchivePresenter present for_s3 and for_db
kevindew 587df6e
Change exporting format to change dates to UTC
kevindew d6c206a
Update Email Archive Service for UTC string timestamps
kevindew 2fcc97c
Update S3 Export data to have higher precision dates
kevindew d6fb1d8
Integrate with EmailArchiveWorker
kevindew File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| class EmailArchivePresenter | ||
| S3_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S.%L".freeze | ||
|
|
||
| # This is expected to be called with a JSON representation of a record | ||
| # returned from EmailArchiveQuery | ||
| def self.for_s3(*args) | ||
| new.for_s3(*args) | ||
| end | ||
|
|
||
| def for_s3(record, archived_at) | ||
| { | ||
| archived_at_utc: archived_at.utc.strftime(S3_DATETIME_FORMAT), | ||
| content_change: build_content_change(record), | ||
| created_at_utc: record.fetch("created_at").utc.strftime(S3_DATETIME_FORMAT), | ||
| finished_sending_at_utc: record.fetch("finished_sending_at").utc.strftime(S3_DATETIME_FORMAT), | ||
| id: record.fetch("id"), | ||
| sent: record.fetch("sent"), | ||
| subject: record.fetch("subject"), | ||
| subscriber_id: record.fetch("subscriber_id"), | ||
| } | ||
| end | ||
|
|
||
| def self.for_db(*args) | ||
| new.for_db(*args) | ||
| end | ||
|
|
||
| def for_db(record, archived_at, exported_at = nil) | ||
| { | ||
| archived_at: archived_at, | ||
| content_change: build_content_change(record), | ||
| created_at: record.fetch("created_at"), | ||
| exported_at: exported_at, | ||
| finished_sending_at: record.fetch("finished_sending_at"), | ||
| id: record.fetch("id"), | ||
| sent: record.fetch("sent"), | ||
| subject: record.fetch("subject"), | ||
| subscriber_id: record.fetch("subscriber_id"), | ||
| } | ||
| end | ||
|
|
||
| private_class_method :new | ||
|
|
||
| private | ||
|
|
||
| def build_content_change(record) | ||
| return if record.fetch("content_change_ids").empty? | ||
|
|
||
| if record.fetch("digest_run_ids").count > 1 | ||
| error = "Email with id: #{record['id']} is associated with "\ | ||
| "multiple digest runs: #{record['digest_run_ids'].join(', ')}" | ||
| GovukError.notify(error) | ||
| end | ||
|
|
||
| { | ||
| content_change_ids: record.fetch("content_change_ids"), | ||
| digest_run_id: record.fetch("digest_run_ids").first, | ||
| subscription_ids: record.fetch("subscription_ids"), | ||
| } | ||
| end | ||
| end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,53 @@ | ||
| class S3EmailArchiveService | ||
| def self.call(*args) | ||
| new.call(*args) | ||
| end | ||
|
|
||
| # For batch we expect an array of hashes containing email data in the format | ||
| # from EmailArchivePresenter | ||
| def call(batch) | ||
| group_by_date(batch).map { |prefix, records| send_to_s3(prefix, records) } | ||
| end | ||
|
|
||
| private_class_method :new | ||
|
|
||
| private | ||
|
|
||
| def group_by_date(batch) | ||
| batch.group_by do |item| | ||
| # we group by date in this way to create partitions for s3/athena | ||
| # these are grouped in case dates span more than one day | ||
| Date.parse( | ||
| item.fetch(:finished_sending_at_utc) | ||
| ).strftime("year=%Y/month=%m/date=%d") | ||
| end | ||
| end | ||
|
|
||
| def send_to_s3(prefix, records) | ||
| records = records.sort_by { |r| r.fetch(:finished_sending_at_utc) } | ||
| last_time = records.last[:finished_sending_at_utc] | ||
| obj = bucket.object(object_name(prefix, last_time)) | ||
| obj.put( | ||
| body: object_body(records), | ||
| content_encoding: "gzip" | ||
| ) | ||
| end | ||
|
|
||
| def bucket | ||
| @bucket ||= begin | ||
| s3 = Aws::S3::Resource.new | ||
| s3.bucket(ENV.fetch("EMAIL_ARCHIVE_S3_BUCKET")) | ||
| end | ||
| end | ||
|
|
||
| def object_name(prefix, last_time) | ||
| uuid = SecureRandom.uuid | ||
| time = ActiveSupport::TimeZone["UTC"].parse(last_time) | ||
| "email-archive/#{prefix}/#{time.to_s(:iso8601)}-#{uuid}.json.gz" | ||
| end | ||
|
|
||
| def object_body(records) | ||
| data = records.map(&:to_json).join("\n") + "\n" | ||
| ActiveSupport::Gzip.compress(data, Zlib::BEST_COMPRESSION) | ||
| end | ||
| end | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
5 changes: 5 additions & 0 deletions
5
db/migrate/20180628134912_add_exported_at_to_email_archives.rb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| class AddExportedAtToEmailArchives < ActiveRecord::Migration[5.2] | ||
| def change | ||
| add_column :email_archives, :exported_at, :datetime | ||
| end | ||
| end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| class AddIndexesToEmailArchive < ActiveRecord::Migration[5.2] | ||
| disable_ddl_transaction! | ||
| def change | ||
| add_index :email_archives, :finished_sending_at, algorithm: :concurrently | ||
| add_index :email_archives, :exported_at, algorithm: :concurrently | ||
| end | ||
| end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,91 @@ | ||
| class EmailArchiveExporter | ||
| def self.call(*args) | ||
| new.call(*args) | ||
| end | ||
|
|
||
| def call(from_date, until_date) | ||
| from_date = Date.parse(from_date) | ||
| until_date = Date.parse(until_date) | ||
|
|
||
| puts "Exporting records that finished sending from #{from_date} and before #{until_date}" | ||
|
|
||
| total = (from_date...until_date).inject(0) { |sum, date| sum + export(date) } | ||
|
|
||
| puts "Exported #{total} records" | ||
| end | ||
|
|
||
| private_class_method :new | ||
|
|
||
| private | ||
|
|
||
| def export(date) | ||
| puts "Exporting #{date}" | ||
| start = Time.now | ||
|
|
||
| count = 0 | ||
|
|
||
| loop do | ||
| records = email_archive_records(date) | ||
|
|
||
| break unless records.any? | ||
|
|
||
| ExportToS3.call(records) | ||
|
|
||
| count += records.count | ||
| puts "Processed #{count} emails" | ||
| end | ||
|
|
||
| seconds = Time.now.to_i - start.to_i | ||
| puts "Completed #{date} in #{seconds} seconds" | ||
|
|
||
| count | ||
| end | ||
|
|
||
| def email_archive_records(date) | ||
| EmailArchive | ||
| .where( | ||
| "finished_sending_at >= ? AND finished_sending_at < ?", | ||
| date, | ||
| date + 1.day | ||
| ) | ||
| .where(exported_at: nil) | ||
| .order(finished_sending_at: :asc, id: :asc) | ||
| .limit(50_000) | ||
| .as_json | ||
| end | ||
|
|
||
| class ExportToS3 | ||
| def self.call(*args) | ||
| new.call(*args) | ||
| end | ||
|
|
||
| def call(records) | ||
| send_to_s3(records) | ||
| mark_as_exported(records) | ||
| end | ||
|
|
||
| private | ||
|
|
||
| def send_to_s3(records) | ||
| batch = records.map do |r| | ||
| { | ||
| archived_at_utc: r["archived_at"].utc.strftime(EmailArchivePresenter::S3_DATETIME_FORMAT), | ||
| content_change: r["content_change"], | ||
| created_at_utc: r["created_at"].utc.strftime(EmailArchivePresenter::S3_DATETIME_FORMAT), | ||
| finished_sending_at_utc: r["finished_sending_at"].utc.strftime(EmailArchivePresenter::S3_DATETIME_FORMAT), | ||
| id: r["id"], | ||
| sent: r["sent"], | ||
| subject: r["subject"], | ||
| subscriber_id: r["subscriber_id"] | ||
| } | ||
| end | ||
|
|
||
| S3EmailArchiveService.call(batch) | ||
| end | ||
|
|
||
| def mark_as_exported(records) | ||
| ids = records.map { |r| r["id"] } | ||
| EmailArchive.where(id: ids).update_all(exported_at: Time.now) | ||
| end | ||
| end | ||
| end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this fails half way through, can we end up in a state where half the archives are uploaded, and then the transaction rolls back and when it runs again we get everything uploaded so end up with double the archives on half of the dates?
I haven't thought about it too much, but can we build this
uuidfrom some sort of hash of the batch to avoid this situation?I realise that in most cases, only one days worth of emails will be archived so this won't be a problem though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is a problem we have with every upload to S3 - if we think it's failed but it actually succeeded we may end up duplicating the data. The only way I can think around this is having a very predictable batching so that we can generate a reliable hash but this seems difficult to do relative to the risk involved.