In [None]:
import { display } from "tslab";
import { readFileSync } from "fs";

const css : string = readFileSync("../style.css", "utf8");
display.html(`<style>${css}</style>`);

# Converting HTML to Text

This notebook demonstrates how to build a simple but effective HTML-to-text converter using **TypeScript** and the parsing library [`lezer`](https://lezer.codemirror.net/).

## The Goal

Our objective is to extract the readable, plain text content from a given HTML document. We will use the HTML source code from the homepage of [Prof. Dr. Karl Stroetmann](http://wwwlehre.dhbw-stuttgart.de/~stroetma/) as our example data. 

To achieve this, we define a grammar in TypeScript that distinguishes  
between different HTML sections such as `<head>`, `<script>`, and normal text.

First, let's load our example HTML data:

In [None]:
const data : string = `
<html>
  <head>
    <meta charset="utf-8">
    <title>Homepage of Prof. Dr. Karl Stroetmann</title>
    <link type="text/css" rel="stylesheet" href="style.css" />
    <link href="http://fonts.googleapis.com/css?family=Rochester&subset=latin,latin-ext"
          rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Pacifico&subset=latin,latin-ext"
          rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Cabin+Sketch&subset=latin,latin-ext" rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Sacramento" rel="stylesheet" type="text/css">
  </head>
  <body>
    <hr/>

    <div id="table">
      <header>
        <h1 id="name">Prof. Dr. Karl Stroetmann</h1>
      </header>

      <div id="row1">
        <div class="right">
          <a id="dhbw" href="http://www.ba-stuttgart.de">Duale Hochschule Baden-W&uuml;rttemberg</a>
          <br/>Coblitzallee 1-9
          <br/>68163 Mannheim
          <br/>Germany
	  <br>
          <br/>Office: &nbsp;&nbsp;&nbsp; Raum 344B
          <br/>Phone:&nbsp;&nbsp;&nbsp; +49 621 4105-1376
          <br/>Fax:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; +49 621 4105-1194
          <br/>Skype: &nbsp;&nbsp;&nbsp; karlstroetmann
        </div>  


        <div id="links">
          <strong class="some">Some links:</strong>
          <ul class="inlink">
            <li class="inlink">
	      My <a class="inlink" href="https://github.com/karlstroetmann?tab=repositories">lecture notes</a>,
              as well as the programs presented in class, can be found
              at <br>
              <a class="inlink" href="https://github.com/karlstroetmann?tab=repositories">https://github.com/karlstroetmann</a>.
              
            </li>
            <li class="inlink">Most of my papers can be found at <a class="inlink" href="https://www.researchgate.net/">researchgate.net</a>.</li>
            <li class="inlink">The programming language SetlX can be downloaded at <br>
              <a href="http://randoom.org/Software/SetlX"><tt class="inlink">http://randoom.org/Software/SetlX</tt></a>.
            </li>
          </ul>
        </div>
      </div>
    </div>
    
    <div id="intro">
      As I am getting old and wise, I have to accept the limits of
      my own capabilities.  I have condensed these deep philosophical
      insights into a most beautiful pearl of poetry.  I would like 
      to share these humble words of wisdom:
      
      <div class="poetry">
        I am a teacher by profession,    <br>
        mostly really by obsession;      <br>
        But even though I boldly try,    <br>
        I just cannot teach <a href="flying-pig.jpg" id="fp">pigs</a> to fly.</br>
        Instead, I slaughter them and fry.
      </div>
      
      <div class="citation">
        <div class="quote">
          Any sufficiently advanced poetry is indistinguishable from divine wisdom.
        </div>
        <div id="sign">His holiness Pope Hugo &#8555;.</div>
      </div>
    </div>
</div>

</body>
</html>
`;

In [None]:
display.html(data)

The original web page is still available at https://wwwlehre.dhbw-stuttgart.de/~stroetma/.

## Imports

Before we can build our HTML lexer, we need to install and import the necessary packages.

In [None]:
import { buildParser } from '@lezer/generator';
import { Tree, TreeCursor } from '@lezer/common';
import { LRParser } from '@lezer/lr';

## Defining the Grammar
In Lezer, we define tokens and the document structure in a declarative grammar string.

## Token Definitions

We need to define tokens for:

```
- **Tags**: HTML tags like `<br/>`, `<div>`, `</div>` which should generally be ignored.
```

- **Named Entities**: Special characters like `&amp;` or `&uuml;`.
- **Unicode Entities**: Numeric references like `&#8594;`.
- **Content**: Regular text.
- **Linebreaks**: To maintain paragraph formatting.

```
Special handling is required for `<head>` and `<script>` blocks, as their content should typically be excluded from the plain text output.
```

### 1. Document Structure

First, we define the entry point (`@top`). A document consists of a sequence of various elements: blocks to ignore (Head, Script), structural elements (Tags, Linebreaks), and actual content (Text, Entities).

In [None]:
const entryPoint = `
  @top Document { (HeadBlock | ScriptBlock | Tag | Entity | Unicode | Linebreak | Content | any)* }
`;

### 2. Basic Token Definitions

We start the `@tokens` block. Here we define the basic building blocks of HTML: tags and line breaks.

* **`Tag`**: Matches standard HTML tags starting with `<` and ending with `>`.
* **`Linebreak`**: Captures newlines to preserve formatting.

In [None]:
const basicTokens = `
  @tokens {
    Tag { "<" ![:]* ">" }
    Linebreak { $[\\n\\r]+ }
`;

### 3. Entity Definitions
Next, we define special character entities. We distinguish between named entities (like `&amp;`) and numeric unicode entities (like `&#8594;`).

In [None]:
const entityTokens = `
    Entity { "&" $[a-zA-Z]+ ";" }
    Unicode { "&#" $[0-9]+ ";" }
`;

### 4. Special Block Definitions
```
This is the most complex part. We need to define blocks for `<head>` and `<script>` tags so that we can ignore their *entire* content (including what looks like text inside them).
```

* **`ScriptBlock`**: Matches the opening `<script...>`, any content inside that is *not* a closing tag, and finally the `</script>`.
* **`HeadBlock`**: Does the same for `<head>`.

In [None]:
const blockTokens = `
    // Matches <script... > ... content ... </script>
    ScriptBlock { "<script" ![>]* ">" !(<)* "</script>" }

    // Matches <head> ... content ... </head>
    HeadBlock { "<head>" !(<)* "</head>" }
`;

### 5. Content and Fallback

Finally, we define what counts as actual text content.

* **`Content`**: Any sequence of characters that is *not* a start of a tag (`<`), an entity start (`&`), or a newline.
* **`any`**: A fallback for single characters that don't match anything else (safety net).

We then close the `@tokens` block.

In [None]:
const contentTokens = `
    Content { ![<&\\n\\r]+ }
    any { _ }
  }
`;

### 6. Building the Parser

Now we concatenate all parts to form the complete grammar string and compile it using `buildParser`.

In [None]:
const finalGrammar = 
    entryPoint + 
    basicTokens + 
    entityTokens + 
    blockTokens + 
    contentTokens;

const parser = buildParser(finalGrammar);

## Helper Functions

To convert the entities back to readable text, we use these helper functions:

In [10]:
function decodeEntity(entity: string): string {
    const raw = entity.substring(1, entity.length - 1);
    const map: Record<string, string> = {
        'amp': '&', 'lt': '<', 'gt': '>', 'quot': '"', 'apos': "'",
        'uuml': '√º', 'auml': '√§', 'ouml': '√∂', 'szlig': '√ü'
    };
    return map[raw] || entity;
}

function decodeUnicode(unicode: string): string {
    const code = parseInt(unicode);
    return String.fromCodePoint(code);
}

In [15]:
decodeHTML("&auml;");

√§


In [17]:
decodeEntity("&euml;");

&euml;


In [13]:
decodeUnicode("8555");

‚Ö´


In [14]:
decodeUnicode("128034");

üê¢


### The Definition of the Token `ANY` 

The `ANY` token is our "catch-all" for regular text content. It matches any sequence of characters that don't start an HTML tag or entity.

In [None]:
const ANY : TokenType = createToken({
  name: "ANY",
  pattern: /[^<&\r\n]+/
});

The pattern `/[^<&\r\n]+/` matches one or more characters that are not:

- `<` (which would start an HTML tag)
- `&` (which would start an HTML entity)
- `\r` or `\n` (which are handled by LINEBREAK)

**Important**: This token must be defined last among the `initial_mode` tokens. Chevrotain tries to match tokens in the order they appear in the mode definition, so more specific patterns (like `TAG`, `NAMED_ENTITY`) must come before this general pattern. Otherwise, `ANY` would greedily consume characters that should be matched by other tokens.

### The Definition of the Token `HEAD_END` 

The `HEAD_END` token marks the end of the HTML header section and triggers a return to normal text extraction mode.

In [None]:
const HEAD_END : TokenType = createToken({
  name: "HEAD_END",
  pattern: /<\/head>/i,
  pop_mode: true
});

The pattern /<\/head>/i matches:

- An opening angle bracket `<`
- A forward slash `\/` (escaped because `/` has special meaning in regex)
- The word "head"
- A closing angle bracket `>`
- The `i` flag makes it case-insensitive

The `pop_mode: true` property tells Chevrotain to return to the previous mode (which was `initial_mode` before we pushed to `header_mode`). This token is only active in `header_mode`, not in the `initial mode` - that's why it will only match the closing tag, not cause conflicts with other patterns.

### The Definition of the Token `SCRIPT_END`

Similar to `HEAD_END`, the `SCRIPT_END` token marks the end of embedded JavaScript code and returns the lexer to normal mode.

In [None]:
const SCRIPT_END : TokenType = createToken({
  name: "SCRIPT_END",
  pattern: /<\/script>/i,
  pop_mode: true
});

The pattern `/<\/script>/i` matches the closing script tag with case-insensitive matching. Like `HEAD_END`, the `pop_mode: true` property returns the lexer to `initial_mode` after this token is matched.

This token is only active in `script_mode`, ensuring that JavaScript code between `<script>` and `</script>` tags is completely ignored and not extracted as text content.

### The Definition of Content Tokens for Special Modes

When the lexer is in `header_mode` or `script_mode`, we need tokens that will consume (and discard) all content until the respective end tag is found.

In [None]:
const HeaderContent : TokenType = createToken({
  name: "HeaderContent",
  pattern: /(.|\n)+?(?=<\/head>)/i,
  line_breaks: true,
  group: Lexer.SKIPPED
});

const ScriptContent : TokenType = createToken({
  name: "ScriptContent",
  pattern: /(.|\n)+?(?=<\/script>)/i,
  line_breaks: true,
  group: Lexer.SKIPPED
});

These patterns use advanced regex features:

- `(.|\n)+?` matches any character (`.`) or newline (`\n`), one or more times, non-greedy (`+?`)
- `(?=<\/head>)` is a positive lookahead‚Äîit checks that the closing tag follows, but doesn't consume it
- `line_breaks: true` is essential because these patterns span multiple lines
- `group: Lexer.SKIPPED` ensures this content is discarded, not extracted

The non-greedy match (`+?`) combined with the lookahead ensures that these tokens stop just before the end tag, allowing `HEAD_END` or `SCRIPT_END` to match correctly. Without the lookahead, the pattern might consume the end tag itself, preventing the mode switch back to `initial_mode`.

## Running the Scanner

### Creating the Lexer

Now that all tokens are defined, we can create the actual Chevrotain lexer. The lexer is configured with multiple modes, each containing a specific set of active tokens.

In [None]:
const HtmlLexer : Lexer = new Lexer({
  defaultMode: "initial_mode",
  modes: {
    initial_mode: [
      HEAD_START,
      SCRIPT_START,
      LINEBREAK,
      TAG,
      NAMED_ENTITY,
      UNICODE,
      ANY
    ],
    header_mode: [
      HEAD_END,
      HeaderContent
    ],
    script_mode: [
      SCRIPT_END,
      ScriptContent
    ]
  }
});

The lexer configuration specifies:

- `defaultMode`: The mode the lexer starts in (`initial_mode`)
- `modes`: An object defining which tokens are active in each mode

**Token order matters!** Within each mode, tokens are tried in the order they appear. Specific patterns (like `NAMED_ENTITY`, `UNICODE`) must come before general ones (like `ANY`) to ensure correct matching.

### Processing Tokens

After tokenization, we need to process the tokens and reconstruct the plain text. The `processTokens` function iterates through all recognized tokens and builds the output string.

Instead of comparing token names as strings (which is prone to typos), we use Chevrotain's `tokenMatcher` utility. This ensures type safety and robustness, even if we rename our tokens later.

In [None]:
function processTokens(tokens: IToken[]): string {
  let result : string = "";
  
  for (const token of tokens) {
    if (tokenMatcher(token, LINEBREAK)) {
      result += "\n";
    } 
    else if (tokenMatcher(token, NAMED_ENTITY)) {
      const entityText: string = token.image;
      const cleanEntity: string = entityText.replace(/^&|;$/g, "");
      result += decodeHTML(`&${cleanEntity};`);
    } 
    else if (tokenMatcher(token, UNICODE)) {
      const unicodeText: string = token.image;
      const cleanNumber: string = unicodeText.replace(/^&#|;$/g, "");
      result += String.fromCodePoint(parseInt(cleanNumber, 10));
    } 
    else if (tokenMatcher(token, ANY)) {
      result += token.image;
    }
  }
  
  return result;
}

Each token type is handled differently:

- **`LINEBREAK`**: Outputs a single newline character. Since our lexer pattern already consumed sequences of whitespace and newlines, this effectively condenses them into one.
- **`NAMED_ENTITY`**: Extracts the entity name (removing the leading `&` and optional trailing `;`) and converts it to a character using `decodeHTML`.
- **`UNICODE`**: Extracts the numeric code (removing `&#` and optional `;`) and converts it to a character using `String.fromCodePoint`.
- **`ANY`**: Outputs the matched text exactly as it appeared in the source.

Note that tokens like `HEAD_START`, `SCRIPT_START`, `HEAD_END`, and `SCRIPT_END` are not handled here because they serve only as control signals for mode switching and do not produce content. Similarly, the `TAG` token is missing because it was marked as `SKIPPED` in the lexer definition.

### Tokenizing and Extracting Text

Finally, we feed our HTML data into the lexer and extract the plain text. The tokenize method returns a lexingResult object containing:

- `tokens`: An array of successfully recognized tokens
- `errors`: An array of any lexing errors encountered

Error checking is included for robustness, though with our `ANY` token as a catch-all, lexing errors should never occur. The extracted text is then printed to the console, showing the HTML document stripped of all tags and with entities properly converted to Unicode characters.

In [None]:
const lexingResult: ILexingResult = HtmlLexer.tokenize(data);

if (lexingResult.errors.length > 0) {
  console.error("Lexing errors detected:");

  for (const error of lexingResult.errors) {
    const charFromMessage: string | undefined = error.message.match(/->(.)<-/)?.[1];
    const illegalChar: string = charFromMessage || data.substr(error.offset, error.length) || "?";

    console.error(`  - Illegal character '${illegalChar}' at line ${error.line}.`);
    console.error(`    This is the ${error.offset}th character.`);
  }
}

const extractedText = processTokens(lexingResult.tokens as IToken[]);
console.log(extractedText);

### Output

The result is clean, readable text extracted from the HTML source. All tags have been removed, HTML entities like `&uuml;` have been converted to their Unicode equivalents (√º), and numeric entities like `&#8555;` have been converted to their characters (‚Ö´).

### Inspecting Individual Tokens

For debugging or educational purposes, you can inspect each token individually to see how the lexer processed the input:

In [None]:
for (const tok of lexingResult.tokens as IToken[]) {
  console.log({
    name: tok.tokenType.name,
    image: tok.image,
    startLine: tok.startLine,
    startColumn: tok.startColumn
  });
}

Each token object contains:

- `tokenType.name`: The type of token (e.g., "`LINEBREAK`", "`ANY`")

- `image`: The actual matched text from the source

- `startLine` and `startColumn`: Position information for debugging

This allows you to see exactly how Chevrotain broke down the HTML into individual tokens before processing.