Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload string translation status to Redshift #40738

Merged
merged 12 commits into from
Aug 25, 2021

Conversation

daynew
Copy link
Member

@daynew daynew commented May 24, 2021

We want to have a Redshift table which we can query to tell if a given string_key has been translated in a given language. This PR queries all the known string_key's, checks their translation status in every language, and then uploads that status to the analysis.i18n_string_translation_status Redshift table.

One thing to note is that before we upload the translation status for a string_key, we first delete the existing data for it. This is to keep the code simple by not having to figure out if our operation is an insert or an update. If we always delete the existing data first, then adding the fresh data to Redshift will always be an insert operation. An alternative to this is following the Redshift Merge examples, but the complexity of that seems unnecessary for our use-case.

Links

Testing story

Ran locally can confirmed records were deleted and then re-inserted

Output of running the script.

image

Redshift query showing how many rows were inserted

image

Sample Redshift data

image

Deployment strategy

Follow-up work

Privacy

Security

Caching

PR Checklist:

  • Tests provide adequate coverage
  • Privacy and Security impacts have been assessed
  • Code is well-commented
  • New features are translatable or updates will not break translations
  • Relevant documentation has been added or updated
  • User impact is well-understood and desirable
  • Pull Request is labeled appropriately
  • Follow-up work items (including potential tech debt) are tracked and linked

@daynew daynew requested a review from hacodeorg May 24, 2021 23:24
@daynew daynew requested a review from a team as a code owner May 24, 2021 23:24
@annaxuphoto
Copy link
Contributor

The testing story is nice, but are there unit test cases we can add as well? Also, are there any cost concerns with deleting and re-inserting entries?

# Record the translation status for every string_key, i.e. "data.script.name.express-2020.title"
string_keys.each do |string_key|
# Maybe there is bad data in the database and we recorded an empty string
next if string_key.blank?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this ever happens, should we know about it? You can fire a Honeybadger error event, then keep going. Or you can let the script raises an exception, it will be reported to Honeybadger automatically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's an issue worth distracting developers to resolve.

end
# Create the rows to be inserted in the analysis.i18n_string_translation_status Redshift table
# https://docs.aws.amazon.com/redshift/latest/dg/c_Examples_of_INSERT_30.html
values = translation_statuses.map do |status|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many rows do you expect translation_statuses would have (number of unique string keys in 90 days x number of locales)? If it is a significant number, is there any risks in saving everything in memory (resources contention on production-daemon) and sending a long query string to Redshift?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about the risk, but I do know that the query successfully ran. If we run into issues with this failing in the future, we can modify this to do batches instead. I'd like to keep it simple for now, and then make it more robust once we actually start hitting limitations.

# inserting them. We do this because Redshift doesn't have unique keys, so if we inserted data for a string_key which
# already exists, then there would be two rows in the database for the string_key.
delete_translation_status(redshift_client, day_count)
insert_translation_status(redshift_client, day_count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if the deletion succeeds but the insertion fails? Can we recover?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recovery would be to run the script again. There isn't a backup, but I don't think that's a big deal because we just need the script to run again successfully. It will add back all the data which was deleted.

end

# This script is scheduled to run regularly on the i18n-dev server.
update_translation_status(90)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add explanation for the value 90? It isn't clear to me why we have to go back 3 months if this script runs everyday.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I was doing this because the existing data is pretty old. I turned off the logging because we have some pending performance improvements assigned to me this sprint. I will make sure to check in a more reasonable time like 7 days or something.

@daynew daynew force-pushed the fnd-1508-update-i18n-translation-status-redshift-table branch from 7cd1594 to 405efd5 Compare July 2, 2021 12:24
@daynew
Copy link
Member Author

daynew commented Jul 2, 2021

The testing story is nice, but are there unit test cases we can add as well? Also, are there any cost concerns with deleting and re-inserting entries?

I have added some unit tests. Let me know if they match what you had in mind.

updater = I18n::TranslationStatus::Updater.new
updater.insert_translation_status(redshift, translation_service, locales, string_keys, current_time)
end
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding unit tests and figuring out mocks for Redshift! Could we add one more test for updating entries - check that we appropriately delete existing data and insert the new data correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test!

Copy link
Contributor

@annaxuphoto annaxuphoto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for accommodating all my requests!

@daynew daynew merged commit 1adc04e into staging Aug 25, 2021
@daynew daynew deleted the fnd-1508-update-i18n-translation-status-redshift-table branch August 25, 2021 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants