Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract and translate placeholder texts #36983

Merged
merged 4 commits into from
Oct 7, 2020
Merged

Conversation

hacodeorg
Copy link
Contributor

@hacodeorg hacodeorg commented Sep 30, 2020

Task FND-1209:
This is the first step to enable translation of placeholder texts. We extract placeholder texts from .level files during the i18n sync-in step.

Since placeholder texts can be empty strings, binary numbers or just several question marks, we require the strings to have at least 3 consecutive alphabetic characters.

Example

A puzzle with placeholder texts https://studio.code.org/s/coursee-2020/stage/9/puzzle/1:
Screen Shot 2020-09-30 at 11 05 41 AM

Those texts are defined in a .level file:

<block type="text">
<title name="TEXT">That's me! Rikki! I like to code, hangout with Thuy, and eat ice cream!</title>
</block>

After sync-in step, placeholder texts are extracted to i18n/locales/source/course_content/2020/coursee-2020.json:

{
  "https://studio.code.org/s/coursee-2020/stage/9/puzzle/1": {
    "placeholder_texts": {
      "b63151482630edd1589a9ac24d107c49": "That's me! Rikki! I like to code, hangout with Thuy, and eat ice cream!",
      "810896b14fc6615f0c76628b9b6a727e": "That's my best friend Thuy! She's really good at sports!",
      "5d48ecdd8e3baeb99c6a8dcb7faa13dd": "Ice cream is my favorite treat! But I probably shouldn't eat it on the couch...",
      "a50ec512bdd98483a7cdcd88cbe11933": "That's my pet rabbit, Ms. Lolipop! I have no idea why I named her that!",
      "6f97975a0bda0c9219915161149cbb26": "That's my computer! I code on it ALL the time!",
      "1c6636ce086fef9305c6e3d25e37d5b2": "Here's a secret: Thuy is extremely ticklish!",
      "fce3b93eaa64a8869c47ba25daa8887f": "Yummy!",
      "41a18649f969a00ec0b2feba20db997f": "I think I like this color better on you, Ms. Lolipop!",
      "756afb811df2eb3ec88a9e98f3dcaa8d": "This computer can't handle my mad coding skills!"
    }
  }
}

After sync-down step, translations for placeholder texts are downloaded to i18n/locales/<locale>/course_content/2020/coursee-2020.json.

{
  "https://studio.code.org/s/coursee-2020/stage/9/puzzle/1": {
    "placeholder_texts": {
      "b63151482630edd1589a9ac24d107c49": "toi la rikki",
      "810896b14fc6615f0c76628b9b6a727e": "day la ban Thuy!",
      "5d48ecdd8e3baeb99c6a8dcb7faa13dd": "kem rat ngon....",
      "a50ec512bdd98483a7cdcd88cbe11933": "day la tho!",
      "6f97975a0bda0c9219915161149cbb26": "day la may tinh!",
      "1c6636ce086fef9305c6e3d25e37d5b2": "day la bi mat",
      "fce3b93eaa64a8869c47ba25daa8887f": "ngon!",
      "41a18649f969a00ec0b2feba20db997f": "mau nay cool!",
      "756afb811df2eb3ec88a9e98f3dcaa8d": "toi qua gioi"
    }
  }
}

After sync-out step, the translations are distributed to dashboard/config/locales/placeholder_texts.<locale>.json.
Example of dashboard/config/locales/placeholder_texts.vi-VN.json:

{
  "vi-VN": {
    "data": {
      "placeholder_texts": {
        "courseE_aboutme_1_2020": {
          "b63151482630edd1589a9ac24d107c49": "toi la rikki",
          "810896b14fc6615f0c76628b9b6a727e": "day la ban Thuy!",
          "5d48ecdd8e3baeb99c6a8dcb7faa13dd": "kem rat ngon....",
          "a50ec512bdd98483a7cdcd88cbe11933": "day la tho!",
          "6f97975a0bda0c9219915161149cbb26": "day la may tinh!",
          "1c6636ce086fef9305c6e3d25e37d5b2": "day la bi mat",
          "fce3b93eaa64a8869c47ba25daa8887f": "ngon!",
          "41a18649f969a00ec0b2feba20db997f": "mau nay cool!",
          "756afb811df2eb3ec88a9e98f3dcaa8d": "toi qua gioi"
        }
      }
    }
  }
}

Rendering the translations:

English Vietnamese
Screen Shot 2020-10-02 at 8 06 27 AM Screen Shot 2020-10-02 at 8 05 02 AM

Testing story

  • Run bin/i18n/sync-in.rb to extract placeholder strings from dashboard/config/scripts/levels/courseE_aboutme_1_2020.level file.
  • Manually create a sync-down output at i18n/locales/vi-VN/course_content/2020/coursee-2020.json.
  • Manually create /tmp/codeorg_changes.json, /tmp/codeorg-markdown_changes.json, /tmp/hour-of-code_changes.json with content.
  • Run bin/i18n/sync-out.rb to distribute translations to dashboard/config/locales/placeholder_texts.vi-VN.json.
  • Go to http://localhost-studio.code.org:3000/s/coursee-2020/stage/9/puzzle/1/lang/vi to see the translations in Vietnamese.

Reviewer Checklist:

  • Tests provide adequate coverage
  • Privacy and Security impacts have been assessed
  • Code is well-commented
  • New features are translatable or updates will not break translations
  • Relevant documentation has been added or updated
  • User impact is well-understood and desirable
  • Pull Request is labeled appropriately
  • Follow-up work items (including potential tech debt) are tracked and linked

@hacodeorg hacodeorg requested review from Hamms and a team September 30, 2020 03:13
@hacodeorg hacodeorg marked this pull request as ready for review September 30, 2020 04:12
next unless text_title&.content =~ /[a-zA-Z]{3,}/

# Use only alphanumeric characters in lower cases as string key
text_key = text_title.content.gsub(/[^a-zA-Z0-9_ ]/, '').split.join('_').downcase
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the idea of inferring an identifier from the content of the string. When we've done things like this in the past, it ends up causing problems when the content changes and strings unexpectedly go missing, or when similar content is used in multiple places and the mapping ends up being non-unique.

Is there anything else we could use as a unique identifier here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using a MD5 hash? It will keep an 1:1 relationship between an ID and a string.
Another option is to use a combination of script id, level id and string position, such as script_11_level_399_str_1.
Did we use any of the above options in the past?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use string contents as IDs for function_definitions and behavior_names, is that because those strings are usually short and contain only alphabetic characters?

i18n_strings['function_definitions'][name.content] = function_definition

i18n_strings['behavior_names'][name.content] = name.content if name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because we were also unable to find a better option there. 🙃 Like I said, we've done this in the past but it's ended up being more fragile than we'd like.

An MD5 hash does address the issues of potential collisions, but we're still ending up with an identifier that's dependent on the content, rather than an identifier that can consistently identify content as it changes. That might be too much to ask for, though.

I'd love to see at least a mockup of the other end of this functionality; the code that's responsible for finding a translation given a block. I think that'll give us a better sense of which direction is best to go here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sync-out and rendering pieces of this functionality is shorter than I thought so I add them to this PR.

The rendering piece still uses string content as ID for now, just so we can verify it can render translations correctly.

@hacodeorg hacodeorg changed the title Extract placeholder texts from .level files Extract and translate placeholder texts Oct 2, 2020
@hacodeorg
Copy link
Contributor Author

Elijah and I discussed this PR further on Slack and decided to go with a MD5-key solution for now. We will explore a generalizable way to easily add unique, reproducible identifiers to XML (in this case .level file).

Copy link
Contributor

@Hamms Hamms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to see a test here! Specifically a test of the localized_text_blocks functionality in https://github.com/code-dot-org/code-dot-org/blob/staging/dashboard/test/models/blockly_test.rb

Otherwise, this looks great! Thanks for taking the time to dig into some options here

@hacodeorg
Copy link
Contributor Author

I'd love to see a test here! Specifically a test of the localized_text_blocks functionality in https://github.com/code-dot-org/code-dot-org/blob/staging/dashboard/test/models/blockly_test.rb

Otherwise, this looks great! Thanks for taking the time to dig into some options here

Thank you for the pointer. Test added.

@hacodeorg hacodeorg merged commit 8068b04 into staging Oct 7, 2020
@hacodeorg hacodeorg deleted the ha/placeholder-text-sync-in branch October 7, 2020 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants