Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guide on multi-mode lexing #132

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
175 changes: 175 additions & 0 deletions hugo/content/guides/multi-mode-lexing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
---
title: "Multi-Mode Lexing"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend changing the title here to something about template literals, possibly

Template Literals with Multi-Mode Lexing

weight: 400
---

Many modern programming languages such as [JavaScript](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals) or [C#](https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/tokens/interpolated) support template literals.
They are a way to easily concatenate or interpolate string values while maintaining great code readability.
This guide will show you how to support template literals in Langium.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This guide will show you how to support template literals in Langium.
This guide will show you how to support template literals in Langium though multi-mode lexing.

This paragraph is still a bit strange, as it reads more like the topic is template literals.


For this specific example, our template literal starts and ends using backticks `` ` `` and are interupted by expressions that are wrapped in curly braces `{}`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For this specific example, our template literal starts and ends using backticks `` ` `` and are interupted by expressions that are wrapped in curly braces `{}`.
For this specific example, our template literal starts and ends with backticks `` ` ``, and is interrupted by expressions that are wrapped in curly braces `{}`.

So in our example, usage of template literals might look something like this:
montymxb marked this conversation as resolved.
Show resolved Hide resolved

```js
println(`hello {name}!`);
```

Conceptually, template strings work by reading a start terminal which starts with `` ` `` and ends with `{`,
montymxb marked this conversation as resolved.
Show resolved Hide resolved
followed by an expression and then an end terminal which is effectively just the start terminal in reverse using `}` and `` ` ``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
followed by an expression and then an end terminal which is effectively just the start terminal in reverse using `}` and `` ` ``.
followed by an expression and an end terminal, which is `}` and `` ` ``.

Since we don't want to restrict users to only a single expression in their template literals, we also need a "middle" terminal reading from `}` to `{`.
montymxb marked this conversation as resolved.
Show resolved Hide resolved
Of course, there's also the option that a user only uses a template literal without any expressions in there.
montymxb marked this conversation as resolved.
Show resolved Hide resolved
So we additionally need a "full" terminal that reads from the start of the literal all the way to the end in one go.
montymxb marked this conversation as resolved.
Show resolved Hide resolved

To achieve this, we will define a `TemplateLiteral` parser rule and a few terminals.
These terminals will adhere to the requirements that we just defined.
To make it a bit easier to read and maintain, we also define a special terminal fragment that we can reuse in all our terminal definitions:
montymxb marked this conversation as resolved.
Show resolved Hide resolved

```antlr
TemplateLiteral:
// Either just the full content
content+=TEMPLATE_LITERAL_FULL |
// Or template literal parts with expressions in between
(
content+=TEMPLATE_LITERAL_START
content+=Expression?
(
content+=TEMPLATE_LITERAL_MIDDLE
content+=Expression?
)*
content+=TEMPLATE_LITERAL_END
)
;

terminal TEMPLATE_LITERAL_FULL:
'`' IN_TEMPLATE_LITERAL* '`';

terminal TEMPLATE_LITERAL_START:
'`' IN_TEMPLATE_LITERAL* '{';

terminal TEMPLATE_LITERAL_MIDDLE:
'}' IN_TEMPLATE_LITERAL* '{';

terminal TEMPLATE_LITERAL_END:
'}' IN_TEMPLATE_LITERAL* '`';

// '{{' is handled in a special way so we can escape normal '{' characters
// '``' is doing the same for the '`' character
terminal fragment IN_TEMPLATE_LITERAL:
/[^{`]|{{|``/;
```

If we go ahead and start parsing files with these changes, most things should work as expected.
However, depending on the structure of your existing grammar, some of these new terminals might be in conflict with existing terminals of your language.
For example, if your language supports block statements, chaining multiple blocks together will make this issue apparent:

```js
{
console.log('hi');
}
{
console.log('hello');
}
```

The `} ... {` block in this example won't be parsed as separate `}` and `{` tokens, but instead as a single `TEMPLATE_LITERAL_MIDDLE` token, resulting in a parser error due to the unexpected token.
This doesn't make a lot of sense, since we aren't in the middle of a template literal at this point anyway.
However, our lexer doesn't know yet that the `TEMPLATE_LITERAL_MIDDLE` and `TEMPLATE_LITERAL_END` terminals are only allowed to show up within a `TemplateLiteral` rule.
To rectify this, we will need to make use of lexer modes. They will give us the necessary context to know whether we're inside a template literal or outside of it.
Depending on the current selected mode, we can lex different terminals. In our case, we want to exclude the `TEMPLATE_LITERAL_MIDDLE` and `TEMPLATE_LITERAL_END` terminals.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Depending on the current selected mode, we can lex different terminals. In our case, we want to exclude the `TEMPLATE_LITERAL_MIDDLE` and `TEMPLATE_LITERAL_END` terminals.
Depending on the current selected mode, we can lex different terminals. In our case, we want to exclude the `TEMPLATE_LITERAL_MIDDLE` and `TEMPLATE_LITERAL_END` terminals unless we've recently parsed a `TEMPLATE_LITERAL_STATE` terminal.


The following implementation of a `TokenBuilder` will do the job for us. It creates two lexing modes, which are almost identical except for the `TEMPLATE_LITERAL_MIDDLE` and `TEMPLATE_LITERAL_END` terminals.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be at least another sentence devoted to what the TokenBuilder is, and another one that describes what the 2 lexing modes are in this context.

We will also need to make sure that the modes are switched based on the `TEMPLATE_LITERAL_START` and `TEMPLATE_LITERAL_END` terminals. We use `PUSH_MODE` and `POP_MODE` for this.
Copy link
Contributor

@montymxb montymxb Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should explain why we need push & pop modes here, and probably a quick detail that there's a lexer mode stack underneath the hood. Even a single sentence would help to keep context.


```ts
Copy link
Contributor

@montymxb montymxb Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend splitting this up into 3 separate parts. Building a custom token builder is non-trivial, and it would help to explain the steps a bit more. I've written a few suggestions below for splits (heads-up, some comments below appear to be out of order with regards to line position).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a step up from this, I still feel we should split this up. But in the interest of moving this along after some time can we instead make an issue for a custom token builder guide separately?

import { DefaultTokenBuilder, isTokenTypeArray, GrammarAST } from "langium";
import { IMultiModeLexerDefinition, TokenType, TokenVocabulary } from "chevrotain";

const REGULAR_MODE = 'regular_mode';
const TEMPLATE_MODE = 'template_mode';

export class CustomTokenBuilder extends DefaultTokenBuilder {

override buildTokens(grammar: GrammarAST.Grammar, options?: { caseInsensitive?: boolean }): TokenVocabulary {
Copy link
Contributor

@montymxb montymxb Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From before, I would first break this out into a separate paragraph, explaining we need to first build up a multi-mode lexer definition that has various modes, which are pushed on by our special tokens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

const tokenTypes = super.buildTokens(grammar, options);

if(isTokenTypeArray(tokenTypes)) {
// Regular mode just drops template literal middle & end
const regularModeTokens = tokenTypes
.filter(token => !['TEMPLATE_LITERAL_MIDDLE','TEMPLATE_LITERAL_END'].includes(token.name));
// Template mode needs to exclude the '}' keyword
const templateModeTokens = tokenTypes
.filter(token => !['}'].includes(token.name));

const multiModeLexerDef: IMultiModeLexerDefinition = {
modes: {
[REGULAR_MODE]: regularModeTokens,
[TEMPLATE_MODE]: templateModeTokens
},
defaultMode: REGULAR_MODE
};
return multiModeLexerDef;
} else {
throw new Error('Invalid token vocabulary received from DefaultTokenBuilder!');
}
}

protected override buildKeywordToken(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make a nice second part, indicating we need cleanup our } token so regular mode doesn't get messed up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

keyword: GrammarAST.Keyword,
terminalTokens: TokenType[],
caseInsensitive: boolean
): TokenType {
let tokenType = super.buildKeywordToken(keyword, terminalTokens, caseInsensitive);

if (tokenType.name === '}') {
// The default } token will use [TEMPLATE_LITERAL_MIDDLE, TEMPLATE_LITERAL_END] as longer alts
// We need to delete the LONGER_ALT, they are not valid for the regular lexer mode
delete tokenType.LONGER_ALT;
}
return tokenType;
}

protected override buildTerminalToken(terminal: GrammarAST.TerminalRule): TokenType {
Copy link
Contributor

@montymxb montymxb Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third part, we can add this & explain how we're associating a push/pop action for start/end literals (which chevrotain needs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

let tokenType = super.buildTerminalToken(terminal);

// Update token types to enter & exit template mode
if(tokenType.name === 'TEMPLATE_LITERAL_START') {
tokenType.PUSH_MODE = TEMPLATE_MODE;
} else if(tokenType.name === 'TEMPLATE_LITERAL_END') {
tokenType.POP_MODE = true;
}
return tokenType;
}
}
```

With this change in place, the parser will work as expected. There is one last issue which we need to resolve in order to get everything working perfectly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
With this change in place, the parser will work as expected. There is one last issue which we need to resolve in order to get everything working perfectly.
With this change in place, the parser should work as expected (depending on your grammar & implementation). However, we still need to clean up leftover artifacts from our start & end sequences.

When inspecting our AST, the `TemplateLiteral` object will contain strings with input artifacts in there; mainly `` ` ``, `{` and `}`.
montymxb marked this conversation as resolved.
Show resolved Hide resolved
These aren't actually part of the semantic value of these strings, so we should get rid of them.
We will need to create a custom `ValueConverter` and remove these artifacts:
montymxb marked this conversation as resolved.
Show resolved Hide resolved

```ts
import { CstNode, GrammarAST, DefaultValueConverter, ValueType, convertString } from 'langium';
montymxb marked this conversation as resolved.
Show resolved Hide resolved

export class CustomValueConverter extends DefaultValueConverter {

protected override runConverter(rule: GrammarAST.AbstractRule, input: string, cstNode: CstNode): ValueType {
if (rule.name.startsWith('TEMPLATE_LITERAL')) {
// 'convertString' simply removes the first and last character of the input
return convertString(input);
montymxb marked this conversation as resolved.
Show resolved Hide resolved
} else {
return super.runConverter(rule, input, cstNode);
}
}
}
```

Of course, let's not forget to bind all of these services:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Of course, let's not forget to bind all of these services:
Of course, let's not forget to bind all of these services in your **module.ts**:


```ts
export const CustomModule = {
parser: {
TokenBuilder: () => new CustomTokenBuilder(),
ValueConverter: () => new CustomValueConverter()
},
};
```
montymxb marked this conversation as resolved.
Show resolved Hide resolved