Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(compiler): add i18nPreserveWhitespaceForLegacyExtraction #56507

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dgp1130
Copy link
Contributor

@dgp1130 dgp1130 commented Jun 18, 2024

This configures whether or not to preserve whitespace content when extracting messages from Angular templates in the legacy (View Engine) extraction pipeline. It defaults to true to maintain backwards compatibility, however false is the ideal behavior since it makes translations less sensitive to unimportant whitespace changes.

The change to i18n_parser.ts is really a bug fix. The localization process requires that the number of tokens in a message stays constant between i18n passes so it can reuse source spans. Therefore even empty text tokens need to be retained in order to maintain that invariant. However this bug fix does affect message IDs, and changing that could invalidate existing translations, therefore it is also gated by the same flag to maintain backwards compatibility.

There are two other issues I discovered while adding this flag. First, Angular template expressions are factored in to the message ID calculation. Meaning changing from Hello, {{ name }} to Hello, {{ firstName }} will trigger retranslatation. That almost sounds correct, except that all expressions are replaced with something like {INTERPOLATION} unless // ph(name = "...") is used, so changing the expression would have no visible effect to translators anyways. Therefore, disabling i18nPreserveWhitespaceForLegacyExtraction removes placeholders from the calculation.

Second, ICU templates are broken up into multiple messages based on whitespace. So <div i18n>{...}</div> and <div i18n> {...} </div> generate one and two $localize messages apiece. Changing this would require updating the generated runtime code, which is out of scope for the time being here. Also reformatting ICUs in certain situations like {apples, plural, =other {Many apples.}} -> {apples, plural, =other {Many\napples.}} still invalidates message IDs. It really shouldn't, but again, this is out of scope for the time being. Instead I'm just leaving ICUs as a known limitation of this flag with only basic support.

@dgp1130 dgp1130 added the target: patch This PR is targeted for the next patch release label Jun 18, 2024
@angular-robot angular-robot bot added the area: compiler Issues related to `ngc`, Angular's template compiler label Jun 18, 2024
@ngbot ngbot bot added this to the Backlog milestone Jun 18, 2024
@dgp1130 dgp1130 force-pushed the whitespace-insensitive branch 6 times, most recently from bc7d69d to 1b3c1c4 Compare June 25, 2024 18:34
@dgp1130 dgp1130 marked this pull request as ready for review June 25, 2024 18:48
@dgp1130 dgp1130 force-pushed the whitespace-insensitive branch 2 times, most recently from 231111a to ac378b8 Compare June 25, 2024 20:33
@dgp1130 dgp1130 force-pushed the whitespace-insensitive branch 2 times, most recently from bde716c to 9a03487 Compare July 12, 2024 18:31
Copy link
Contributor

@AndrewKushnir AndrewKushnir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgp1130 looks great, thanks for the updates 👍 I've left a few minor comments, but none of them are blocking.

As a followup (in a separate PR), it might be great to put together a markdown file (e.g. packages/compiler/src/i18n/EXTRACTION.md) and describe how the extraction process depends on whitespaces in messages and how different subsystems are configured to deal with it (e.g. ViewEngine-based extraction, code generation with goog.getMsg, etc). You have this information already in different comments in the code (which is very helpful!) and a document may help provide an overall terminology (e.g. what a "significant whitespace" means), how this works from high-level and we can also use the doc as a reference in comments to the code that you've added. This document will help us in the future to restore the context once we decide to perform further changes to this code.

Thank you.

Copy link
Contributor

@thePunderWoman thePunderWoman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed-for: public-api

@dgp1130 dgp1130 added the action: cleanup The PR is in need of cleanup, either due to needing a rebase or in response to comments from reviews label Jul 18, 2024
@dgp1130 dgp1130 force-pushed the whitespace-insensitive branch 4 times, most recently from 55af290 to c7e163c Compare July 18, 2024 18:44
@dgp1130 dgp1130 force-pushed the whitespace-insensitive branch 2 times, most recently from 7ae978f to 4742312 Compare July 18, 2024 19:40
@dgp1130
Copy link
Contributor Author

dgp1130 commented Jul 19, 2024

@dgp1130 dgp1130 added action: review The PR is still awaiting reviews from at least one requested reviewer and removed action: cleanup The PR is in need of cleanup, either due to needing a rebase or in response to comments from reviews labels Jul 20, 2024
@dgp1130 dgp1130 force-pushed the whitespace-insensitive branch 2 times, most recently from a5fdc4a to e43049a Compare July 22, 2024 20:22
This configures whether or not to preserve whitespace content when extracting messages from Angular templates in the legacy (View Engine) extraction pipeline.

The change to `i18n_parser.ts` is really a bug fix. The localization process requires that the nubmer of tokens in a message stays constant between i18n passes so it can reuse source spans. Therefore even empty text tokens need to be retained in order to maintain that invariant. However this bug fix does affect message IDs, and changing that could invalidate existing translations, therefore it is also gated by the same flag to maintain backwards compatibility.
…pace` and `preserveWhitespaces` are used together correctly

This bans the case where `preserveSignificantWhitespace` is disabled but `preserveWhitespace` is enabled. Such a case would imply retaining all whitespace but also dropping significant whitespace which is not possible.
…Visitor`

When disabling `i18nPreserveSignificantWhitespaceForLegacyExtraction` I was looking at a test case with ICU messages containing leading and trailing whitespace:

```angular
<div i18n>
  {apples, plural, =other {I have many apples.}}
</div>
```

This would historically generate two messages:

```javascript
const MSG_TMP = goog.getMsg('{apples, plural, =other {I have many apples.}}');
const MSG_FOO = goog.getMsg(' {$ICU} ', { 'ICU': MSG_TMP });
```

But I found that I was getting just one message:

```javascript
const MSG_TMP = goog.getMsg(' {apples, plural, =other {I have many apples.}} ');
```

This is arguably an improvement, but changed the messages and message IDs, which isn't desirable with this option. I eventually traced this back to the `isIcu` initialization in [`i18n_parser.ts`](/packages/compiler/src/i18n/i18n_parser.ts):

```typescript
const context: I18nMessageVisitorContext = {
  isIcu: nodes.length == 1 && nodes[0] instanceof html.Expansion,
  // ...
};
```

[`_I18nVisitor.prototype.visitExpansion`](/packages/compiler/src/i18n/i18n_parser.ts) uses this to decide whether or not to generate a sub-message for a given ICU expansion:

```typescript
if (context.isIcu || context.icuDepth > 0) {
  // Returns an ICU node when:
  // - the message (vs a part of the message) is an ICU message, or
  // - the ICU message is nested.
  const expPh = context.placeholderRegistry.getUniquePlaceholder(`VAR_${icu.type}`);
  i18nIcu.expressionPlaceholder = expPh;
  context.placeholderToContent[expPh] = {
    text: icu.switchValue,
    sourceSpan: icu.switchValueSourceSpan,
  };
  return context.visitNodeFn(icu, i18nIcu);
}

// Else returns a placeholder
// ICU placeholders should not be replaced with their original content but with the their
// translations.
// TODO(vicb): add a html.Node -> i18n.Message cache to avoid having to re-create the msg
const phName = context.placeholderRegistry.getPlaceholderName('ICU', icu.sourceSpan.toString());
context.placeholderToMessage[phName] = this.toI18nMessage([icu], '', '', '', undefined);
const node = new i18n.IcuPlaceholder(i18nIcu, phName, icu.sourceSpan);
return context.visitNodeFn(icu, node);
```

Note that `isIcu` is the key condition between these two cases and depends on whether or not the ICU expansion has any siblings. The introduction of `WhitespaceVisitor` to `I18nMetaVisitor` trims insignificant whitespace, including empty text nodes not adjacent to an ICU expansion (from [`WhitespaceVisitor.prototype.visitText`](/packages/compiler/src/ml_parser/html_whitespaces.ts)):

```typescript
const isNotBlank = text.value.match(NO_WS_REGEXP);
const hasExpansionSibling =
  context && (context.prev instanceof html.Expansion || context.next instanceof html.Expansion);

if (isNotBlank || hasExpansionSibling) {
  // Transform node by trimming it...
  return trimmedNode;
}

return null; // Drop node which is empty and has no ICU expansion sibling.
```

`hasExpansionSibling` was intended to retain empty text nodes leading or trailing an ICU expansion, however `context` was `undefined`, so this check failed and the leading / trailing text nodes were dropped. This resulted in trimming the ICU text by dropping the leading / trailing whitespace nodes. Having only a single ICU expansion with no leading / trailing text nodes caused `_I18nVisitor` to initialize `isIcu` incorrectly and caused it to generate one message instead of two.

`WhitespaceVisitor` is supposed to get this context from `visitAllWithSiblings`. So the fix here is to make sure `WhitespaceVisitor` is always visited via this function which provides the required context. I updated all usage sites to make sure this context is use consistently and implemented the `WhitespaceVisitor.prototype.visit` method to throw when the context is missing to make sure we don't encounter a similar mistake in the future.

Unfortunately this broke one compliance test. Specifically the [`icu_logic/icu_only.js`](/home/douglasparker/Source/ng/packages/compiler-cli/test/compliance/test_cases/r3_view_compiler_i18n/icu_logic/icu_only.js) test which changed from generating:

```javascript
function MyComponent_Template(rf, ctx) {
  if (rf & 1) {
    i0.ɵɵi18n(0, 0);
  }
  // ...
}
```

To now generating:

```javascript
function MyComponent_Template(rf, ctx) {
  if (rf & 1) {
    i0.ɵɵtext(0, " ");
    i0.ɵɵi18n(1, 0);
    i0.ɵɵtext(2, "\n");
  }
  // ...
}
```

This test uses the default value `preserveWhitespaces: false` (`i18nPreserveSignificantWhitespaceForLegacyExtraction` should not affect compiled JS output, we already retain significant whitespace there). So what this indicates to me is that ICU logic is already broken because it's not preserving significant whitespace in this case. My change is probably a bug fix, but one which would affect the compiled runtime, which is not in scope here. The root cause is because using `visitAllWithSiblings` everywhere means the context is retained correctly in this case and the whitespace is leading/trailing an ICU message, therefore it is retained per the logic of `WhitespaceVisitor.prototype.visitText` I mentioned eariler.

To address this, I left one usage of `WhitespaceVisitor` using `html.visitAll` instead of `visitAllWithSiblings` to retain this bug. I has to lossen the assertion I put in `WhitespaceVisitor.prototype.visit` to make this possible, but it should still throw by default when misused, which is the important part.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action: review The PR is still awaiting reviews from at least one requested reviewer area: compiler Issues related to `ngc`, Angular's template compiler target: patch This PR is targeted for the next patch release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants