-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload string translation status to Redshift #40738
Upload string translation status to Redshift #40738
Conversation
…nd-1508-update-i18n-translation-status-redshift-table
The testing story is nice, but are there unit test cases we can add as well? Also, are there any cost concerns with deleting and re-inserting entries? |
# Record the translation status for every string_key, i.e. "data.script.name.express-2020.title" | ||
string_keys.each do |string_key| | ||
# Maybe there is bad data in the database and we recorded an empty string | ||
next if string_key.blank? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this ever happens, should we know about it? You can fire a Honeybadger error event, then keep going. Or you can let the script raises an exception, it will be reported to Honeybadger automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's an issue worth distracting developers to resolve.
end | ||
# Create the rows to be inserted in the analysis.i18n_string_translation_status Redshift table | ||
# https://docs.aws.amazon.com/redshift/latest/dg/c_Examples_of_INSERT_30.html | ||
values = translation_statuses.map do |status| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many rows do you expect translation_statuses
would have (number of unique string keys in 90 days x number of locales)? If it is a significant number, is there any risks in saving everything in memory (resources contention on production-daemon) and sending a long query string to Redshift?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know about the risk, but I do know that the query successfully ran. If we run into issues with this failing in the future, we can modify this to do batches instead. I'd like to keep it simple for now, and then make it more robust once we actually start hitting limitations.
# inserting them. We do this because Redshift doesn't have unique keys, so if we inserted data for a string_key which | ||
# already exists, then there would be two rows in the database for the string_key. | ||
delete_translation_status(redshift_client, day_count) | ||
insert_translation_status(redshift_client, day_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if the deletion succeeds but the insertion fails? Can we recover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recovery would be to run the script again. There isn't a backup, but I don't think that's a big deal because we just need the script to run again successfully. It will add back all the data which was deleted.
end | ||
|
||
# This script is scheduled to run regularly on the i18n-dev server. | ||
update_translation_status(90) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add explanation for the value 90? It isn't clear to me why we have to go back 3 months if this script runs everyday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I was doing this because the existing data is pretty old. I turned off the logging because we have some pending performance improvements assigned to me this sprint. I will make sure to check in a more reasonable time like 7 days or something.
…nd-1508-update-i18n-translation-status-redshift-table
…nd-1508-update-i18n-translation-status-redshift-table
…nd-1508-update-i18n-translation-status-redshift-table
…nd-1508-update-i18n-translation-status-redshift-table
…nd-1508-update-i18n-translation-status-redshift-table
7cd1594
to
405efd5
Compare
I have added some unit tests. Let me know if they match what you had in mind. |
updater = I18n::TranslationStatus::Updater.new | ||
updater.insert_translation_status(redshift, translation_service, locales, string_keys, current_time) | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding unit tests and figuring out mocks for Redshift! Could we add one more test for updating entries - check that we appropriately delete existing data and insert the new data correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test!
…nd-1508-update-i18n-translation-status-redshift-table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for accommodating all my requests!
We want to have a Redshift table which we can query to tell if a given string_key has been translated in a given language. This PR queries all the known string_key's, checks their translation status in every language, and then uploads that status to the
analysis.i18n_string_translation_status
Redshift table.One thing to note is that before we upload the translation status for a string_key, we first delete the existing data for it. This is to keep the code simple by not having to figure out if our operation is an insert or an update. If we always delete the existing data first, then adding the fresh data to Redshift will always be an insert operation. An alternative to this is following the Redshift Merge examples, but the complexity of that seems unnecessary for our use-case.
Links
Testing story
Ran locally can confirmed records were deleted and then re-inserted
Output of running the script.
Redshift query showing how many rows were inserted
Sample Redshift data
Deployment strategy
Follow-up work
Privacy
Security
Caching
PR Checklist: