In [1]:
import { display } from "tslab";
import { readFileSync } from "fs";

const css = readFileSync("../style.css", "utf8");
display.html(`<style>${css}</style>`);

# Converting HTML to Text

This notebook demonstrates how to build a simple but effective HTML-to-text converter using **TypeScript** and the powerful parsing library [`chevrotain`](https://chevrotain.io/).

## The Goal

Our objective is to extract the readable, plain text content from a given HTML document. We will use the HTML source code from the homepage of [Prof. Dr. Karl Stroetmann](http://wwwlehre.dhbw-stuttgart.de/~stroetma/) as our example data. 

To achieve this, we implement a small **state machine** in TypeScript that distinguishes  
between different HTML sections such as `<head>`, `<script>`, and normal text.


## The Strategy: A State Machine

HTML is not a simple, linear format. Some sections, like the `<head>` and `<script>` blocks, should be completely ignored, while the content in the `<body>` should be processed.

To handle this, we will implement a simple state machine using Chevrotain's "Lexer Modes" feature. Our lexer will switch between different states (or modes) depending on the context:

- initial_mode: The default state for processing normal text content.
- header_mode: An "ignore" state activated when entering a `<head>` tag.
- script_mode: An "ignore" state activated when entering a `<script>` tag.

This approach allows us to create a robust tokenizer that correctly distinguishes between content to be extracted and content to be discarded.

In [2]:
const data = `
<html>
  <head>
    <meta charset="utf-8">
    <title>Homepage of Prof. Dr. Karl Stroetmann</title>
    <link type="text/css" rel="stylesheet" href="style.css" />
    <link href="http://fonts.googleapis.com/css?family=Rochester&subset=latin,latin-ext"
          rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Pacifico&subset=latin,latin-ext"
          rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Cabin+Sketch&subset=latin,latin-ext" rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Sacramento" rel="stylesheet" type="text/css">
  </head>
  <body>
    <hr/>

    <div id="table">
      <header>
        <h1 id="name">Prof. Dr. Karl Stroetmann</h1>
      </header>

      <div id="row1">
        <div class="right">
          <a id="dhbw" href="http://www.ba-stuttgart.de">Duale Hochschule Baden-W&uuml;rttemberg</a>
          <br/>Coblitzallee 1-9
          <br/>68163 Mannheim
          <br/>Germany
	  <br>
          <br/>Office: &nbsp;&nbsp;&nbsp; Raum 344B
          <br/>Phone:&nbsp;&nbsp;&nbsp; +49 621 4105-1376
          <br/>Fax:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; +49 621 4105-1194
          <br/>Skype: &nbsp;&nbsp;&nbsp; karlstroetmann
        </div>  


        <div id="links">
          <strong class="some">Some links:</strong>
          <ul class="inlink">
            <li class="inlink">
	      My <a class="inlink" href="https://github.com/karlstroetmann?tab=repositories">lecture notes</a>,
              as well as the programs presented in class, can be found
              at <br>
              <a class="inlink" href="https://github.com/karlstroetmann?tab=repositories">https://github.com/karlstroetmann</a>.
              
            </li>
            <li class="inlink">Most of my papers can be found at <a class="inlink" href="https://www.researchgate.net/">researchgate.net</a>.</li>
            <li class="inlink">The programming language SetlX can be downloaded at <br>
              <a href="http://randoom.org/Software/SetlX"><tt class="inlink">http://randoom.org/Software/SetlX</tt></a>.
            </li>
          </ul>
        </div>
      </div>
    </div>
    
    <div id="intro">
      As I am getting old and wise, I have to accept the limits of
      my own capabilities.  I have condensed these deep philosophical
      insights into a most beautiful pearl of poetry.  I would like 
      to share these humble words of wisdom:
      
      <div class="poetry">
        I am a teacher by profession,    <br>
        mostly really by obsession;      <br>
        But even though I boldly try,    <br>
        I just cannot teach <a href="flying-pig.jpg" id="fp">pigs</a> to fly.</br>
        Instead, I slaughter them and fry.
      </div>
      
      <div class="citation">
        <div class="quote">
          Any sufficiently advanced poetry is indistinguishable from divine wisdom.
        </div>
        <div id="sign">His holiness Pope Hugo &#8555;.</div>
      </div>
    </div>
</div>

</body>
</html>
`;

In [3]:
display.html(data)

The original web page is still available at https://wwwlehre.dhbw-stuttgart.de/~stroetma/.

## Imports

Before we can build our HTML lexer, we need to install and import the necessary packages.

We use two packages for this project:

- `chevrotain`: A powerful parsing toolkit that provides the lexer functionality we need
- `entities`: A utility library for decoding HTML entities (like `&uuml;` → ü)

In [4]:
const { execSync } = await import('child_process');
console.log(execSync('npm install chevrotain@10').toString());
console.log(execSync('npm install entities').toString());




up to date, audited 10 packages in 1s

1 package is looking for funding
  run `npm fund` for details

found 0 vulnerabilities






up to date, audited 10 packages in 1s

1 package is looking for funding
  run `npm fund` for details

found 0 vulnerabilities



From Chevrotain, we import the core components needed for building our lexer:

In [5]:
import { createToken, Lexer, IToken } from "chevrotain";
import { decodeHTML } from "entities";

- `createToken`: Function to define individual token types
- `Lexer`: The lexer class that will tokenize our HTML input
- `IToken`: TypeScript interface for token objects
- `decodeHTML`: Function to convert HTML entities to Unicode characters

With these imports in place, we're ready to define our tokens and build the lexer.

## Token Definitions

In this section, we define the tokens needed to process HTML content and extract plain text. In Chevrotain, token definitions are declarative and separate from the processing logic.

Each token is created using Chevrotain's `createToken` function, which takes a configuration object with key properties:

- `name`: A string identifier for the token type
- `pattern`: A regular expression defining what strings this token matches
- `push_mode`: Switches the lexer to a diffrent mode when this token is matched
- `pop_mode`: Returns the lexer to the previous mode
- `group`: Controls whether tokens appear in the output (e.g. Lexer.SKIPPED)
- `line_breaks`: Indicates if the pattern can contain newlines

### The Definition of the Token `HEAD_START`

When the scanner encounters the opening tag `<head>`, it needs to switch into a special mode where all content is ignored until the closing `</head>` tag appears.

In Chevrotain, mode switching is handled declaratively using the `push_mode` property. This token pushes the lexer into `header_mode`, where a different set of token rules becomes active. The token is only recognized in the default `initial_mode` state.

In [6]:
const HEAD_START = createToken({
  name: "HEAD_START",
  pattern: /<head>/i,
  push_mode: "header_mode"
});

The pattern `/<head>/i` uses the i flag for case-insensitive matching, so it will match `<head>`, `<HEAD>`, or any other case variation.​

### The Definition of the Token `SCRIPT_START`

Similar to `HEAD_START`, when the scanner reads an opening `<script>` tag, it switches into `script_mode`. In this mode, all content is discarded until the closing `</script>` tag is found.

In [7]:
const SCRIPT_START = createToken({
  name: "SCRIPT_START",
  pattern: /<script\b[^>]*>/i,
  push_mode: "script_mode"
});

The pattern `/<script\b[^>]*>/i` is more sophisticated than a simple `<script>` match:

- `\b` ensures a word boundary after "script" (preventing matches like `<scripting>`)
- `[^>]*` matches any attributes that might follow (e.g., `<script type="text/javascript">`)
- The `i` flag makes it case-insensitive

This token transitions the lexer to `script_mode`, where JavaScript code embedded in the HTML will be ignored rather than extracted as text.​

### The Definition of the Token `LINEBREAK`

The `LINEBREAK` token handles whitespace and newline characters in the HTML document. Instead of preserving every single newline and space, we condense multiple consecutive whitespace-newline sequences into a single newline character.

In [8]:
const LINEBREAK = createToken({
  name: "LINEBREAK",
  pattern: /(\s*\n\s*)+/,
  line_breaks: true
});

The pattern `(/\s*\n\s*)+/` matches one or more sequences of:

- Optional whitespace (`\s*`)
- A newline character (`\n`)
- Optional whitespace again (`\s*`)

The `line_breaks: true` property is crucial - it tells Chevrotain that this token can contain newline characters, allowing the lexer to correctly track line and column positions in the source document. This token is only active in `initial_mode`.

### The Definition of the Token `TAG`

The `TAG` token matches any generic HTML tag that isn't specifically handled by other tokens (like `HEAD_START` or `SCRIPT_START`)

In [9]:
const TAG = createToken({
  name: "TAG",
  pattern: /<[^>]+>/,
  group: Lexer.SKIPPED
});

The pattern `/<[^>]+>/` matches:

- An opening angle bracket `<`
- One or more characters that are not a closing angle bracket `([^>]+)`
- A closing angle bracket `>`

This catches tags like `<div>`, `</p>`, `<br/>`, `<a href="...">`, etc. The `group: Lexer.SKIPPED` property is important - it tells Chevrotain to recognize these tags but immediately discard them from the token stream. This means they won't appear in our final output, which is exactly what we want when extracting plain text from HTML.

### The Definition of the Token `NAMED_ENTITY`

<span style="font-variant:small-caps;">Html</span> uses named entities to represent special characters, like `&auml;` for "ä" or `&nbsp;` for a non-breaking space. The `NAMED_ENTITY` token recognizes these patterns.

In [10]:
const NAMED_ENTITY = createToken({
  name: "NAMED_ENTITY",
  pattern: /&[A-Za-z]+;?/
});

The pattern `/&[A-Za-z]+;?/` matches:

- An ampersand `&`
- One or more letters (`[A-Za-z]+`)
- An optional semicolon (`;?`)

The semicolon is optional because some HTML documents omit it, though it's technically required by the HTML5 standard. Examples this matches:

- `&auml`; → ä
- `&uuml`; → ü
- `&nbsp`; → non-breaking space

Later, in our token processing function, we'll use the decodeHTML function from the entities package to convert these named entities into their corresponding Unicode characters:

In [11]:
decodeHTML("&auml;"); // ä

ä


### The Definition of the Token `UNICODE` 

Besides named entities, <span style="font-variant:small-caps;">Html</span> also supports numeric Unicode entities that specify characters by their code point. These come in two forms: decimal (like `&#8555;`) and hexadecimal (though we only handle decimal here).

In [12]:
const UNICODE = createToken({
  name: "UNICODE",
  pattern: /&#[0-9]+;?/
});

The pattern /&#[0-9]+;?/ matches:

- An ampersand `&`
- A hash symbol `#`
- One or more digits (`[0-9]+`)
- An optional semicolon (`;?`)

Examples this matches:

- `&#8555;` → Ⅻ (Roman numeral twelve)
- `&#128034;` → 🐢 (turtle emoji)
- `&#228;` → ä (same as &auml;)

Like `NAMED_ENTITY`, this token must come before the `ANY` token in the mode definition. Otherwise, the `&` character would be captured by `ANY`, and the entity would never be recognized. In our token processing function, we'll use String.fromCodePoint() to convert the numeric code into its corresponding Unicode character.

In [13]:
String.fromCodePoint(8555)

Ⅻ


In [14]:
String.fromCodePoint(128034)

🐢


### The Definition of the Token `ANY` 

The `ANY` token is our "catch-all" for regular text content. It matches any sequence of characters that don't start an HTML tag or entity.

In [15]:
const ANY = createToken({
  name: "ANY",
  pattern: /[^<&\r\n]+/
});

The pattern `/[^<&\r\n]+/` matches one or more characters that are not:

- `<` (which would start an HTML tag)
- `&` (which would start an HTML entity)
- `\r` or `\n` (which are handled by LINEBREAK)

**Important**: This token must be defined last among the `initial_mode` tokens. Chevrotain tries to match tokens in the order they appear in the mode definition, so more specific patterns (like `TAG`, `NAMED_ENTITY`) must come before this general pattern. Otherwise, `ANY` would greedily consume characters that should be matched by other tokens.

### The Definition of the Token `HEAD_END` 

The `HEAD_END` token marks the end of the HTML header section and triggers a return to normal text extraction mode.

In [16]:
const HEAD_END = createToken({
  name: "HEAD_END",
  pattern: /<\/head>/i,
  pop_mode: true
});

The pattern /<\/head>/i matches:

- An opening angle bracket `<`
- A forward slash `\/` (escaped because `/` has special meaning in regex)
- The word "head"
- A closing angle bracket `>`
- The `i` flag makes it case-insensitive

The `pop_mode: true` property tells Chevrotain to return to the previous mode (which was `initial_mode` before we pushed to `header_mode`). This token is only active in `header_mode`, not in the `initial mode` - that's why it will only match the closing tag, not cause conflicts with other patterns.

### The Definition of the Token `SCRIPT_END`

Similar to `HEAD_END`, the `SCRIPT_END` token marks the end of embedded JavaScript code and returns the lexer to normal mode.

In [17]:
const SCRIPT_END = createToken({
  name: "SCRIPT_END",
  pattern: /<\/script>/i,
  pop_mode: true
});

The pattern `/<\/script>/i` matches the closing script tag with case-insensitive matching. Like `HEAD_END`, the `pop_mode: true` property returns the lexer to `initial_mode` after this token is matched.

This token is only active in `script_mode`, ensuring that JavaScript code between `<script>` and `</script>` tags is completely ignored and not extracted as text content.

### The Definition of Content Tokens for Special Modes

When the lexer is in `header_mode` or `script_mode`, we need tokens that will consume (and discard) all content until the respective end tag is found.

In [18]:
const HeaderContent = createToken({
  name: "HeaderContent",
  pattern: /(.|\n)+?(?=<\/head>)/i,
  line_breaks: true,
  group: Lexer.SKIPPED
});

const ScriptContent = createToken({
  name: "ScriptContent",
  pattern: /(.|\n)+?(?=<\/script>)/i,
  line_breaks: true,
  group: Lexer.SKIPPED
});

These patterns use advanced regex features:

- `(.|\n)+?` matches any character (`.`) or newline (`\n`), one or more times, non-greedy (`+?`)
- `(?=<\/head>)` is a positive lookahead—it checks that the closing tag follows, but doesn't consume it
- `line_breaks: true` is essential because these patterns span multiple lines
- `group: Lexer.SKIPPED` ensures this content is discarded, not extracted

The non-greedy match (`+?`) combined with the lookahead ensures that these tokens stop just before the end tag, allowing `HEAD_END` or `SCRIPT_END` to match correctly. Without the lookahead, the pattern might consume the end tag itself, preventing the mode switch back to `initial_mode`.

## Running the Scanner

### Creating the Lexer

Now that all tokens are defined, we can create the actual Chevrotain lexer. The lexer is configured with multiple modes, each containing a specific set of active tokens.

In [19]:
const HtmlLexer = new Lexer({
  defaultMode: "initial_mode",
  modes: {
    initial_mode: [
      HEAD_START,
      SCRIPT_START,
      LINEBREAK,
      TAG,
      NAMED_ENTITY,
      UNICODE,
      ANY
    ],
    header_mode: [
      HEAD_END,
      HeaderContent
    ],
    script_mode: [
      SCRIPT_END,
      ScriptContent
    ]
  }
});

The lexer configuration specifies:

- `defaultMode`: The mode the lexer starts in (`initial_mode`)
- `modes`: An object defining which tokens are active in each mode

**Token order matters!** Within each mode, tokens are tried in the order they appear. Specific patterns (like `NAMED_ENTITY`, `UNICODE`) must come before general ones (like `ANY`) to ensure correct matching.

### Processing Tokens

After tokenization, we need to process the tokens and reconstruct the plain text. The processTokens function iterates through all tokens and builds the output string based on token type.

In [20]:
function processTokens(tokens: IToken[]): string {
  let result = "";
  
  tokens.forEach(token => {
    switch (token.tokenType.name) {
      case "LINEBREAK":
        result += "\n";
        break;
        
      case "NAMED_ENTITY":
        const entityText = token.image;
        let entityName: string;
        
        if (entityText.endsWith(';')) {
          entityName = entityText.slice(1, -1);
        } else {
          entityName = entityText.slice(1);
        }
        
        result += decodeHTML(`&${entityName};`);
        break;
        
      case "UNICODE":
        const unicodeText = token.image;
        let numberStr: string;
        
        if (unicodeText.endsWith(';')) {
          numberStr = unicodeText.slice(2, -1);
        } else {
          numberStr = unicodeText.slice(2);
        }
        
        result += String.fromCodePoint(parseInt(numberStr));
        break;
        
      case "ANY":
        result += token.image;
        break;
    }
  });
  
  return result;
}

Each token type is handled differently:

- `LINEBREAK`: Outputs a single newline character, condensing multiple whitespace-newline sequences
- `NAMED_ENTITY`: Extracts the entity name (removing `&` and optional `;`) and converts it using decodeHTML
- `UNICODE`: Extracts the numeric code (removing `&#` and optional `;`) and converts it using String.fromCodePoint
- `ANY`: Outputs the matched text as-is

Tokens like `HEAD_START`, `SCRIPT_START`, `HEAD_END`, and `SCRIPT_END` don't produce output - they only control mode switching. The `TAG` token doesn't appear here because it's marked as `SKIPPED`.

### Tokenizing and Extracting Text

Finally, we feed our HTML data into the lexer and extract the plain text. The tokenize method returns a lexingResult object containing:

- `tokens`: An array of successfully recognized tokens
- `errors`: An array of any lexing errors encountered

Error checking is included for robustness, though with our `ANY` token as a catch-all, lexing errors should never occur. The extracted text is then printed to the console, showing the HTML document stripped of all tags and with entities properly converted to Unicode characters.

In [21]:
const lexingResult = HtmlLexer.tokenize(data);

if (lexingResult.errors.length > 0) {
  console.error("Lexing errors detected:");
  lexingResult.errors.forEach(error => {
    console.error(`  - ${error.message} at offset ${error.offset}`);
  });
}

const extractedText = processTokens(lexingResult.tokens);
console.log(extractedText);








Prof. Dr. Karl Stroetmann



Duale Hochschule Baden-Württemberg
Coblitzallee 1-9
68163 Mannheim
Germany

Office:     Raum 344B
Phone:    +49 621 4105-1376
Fax:        +49 621 4105-1194
Skype:     karlstroetmann


Some links:


My lecture notes,
as well as the programs presented in class, can be found
at 
https://github.com/karlstroetmann.

Most of my papers can be found at researchgate.net.
The programming language SetlX can be downloaded at 
http://randoom.org/Software/SetlX.






As I am getting old and wise, I have to accept the limits of
my own capabilities.  I have condensed these deep philosophical
insights into a most beautiful pearl of poetry.  I would like 
to share these humble words of wisdom:

I am a teacher by profession,    
mostly really by obsession;      
But even though I boldly try,    
I just cannot teach pigs to fly.
Instead, I slaughter them and fry.



Any sufficiently advanced poetry is indistinguishable from divine wisdom.

His holiness Pope Hugo Ⅻ.





### Output

The result is clean, readable text extracted from the HTML source. All tags have been removed, HTML entities like `&uuml;` have been converted to their Unicode equivalents (ü), and numeric entities like `&#8555;` have been converted to their characters (Ⅻ).

### Inspecting Individual Tokens

For debugging or educational purposes, you can inspect each token individually to see how the lexer processed the input:

In [22]:
for (const tok of lexingResult.tokens) {
  console.log({
    name: tok.tokenType.name,
    image: tok.image,
    startLine: tok.startLine,
    startColumn: tok.startColumn
  });
}


{ name: [32m'LINEBREAK'[39m, image: [32m'\n'[39m, startLine: [33m1[39m, startColumn: [33m1[39m }
{ name: [32m'LINEBREAK'[39m, image: [32m'\n  '[39m, startLine: [33m2[39m, startColumn: [33m7[39m }
{ name: [32m'HEAD_START'[39m, image: [32m'<head>'[39m, startLine: [33m3[39m, startColumn: [33m3[39m }
{ name: [32m'HEAD_END'[39m, image: [32m'</head>'[39m, startLine: [33m13[39m, startColumn: [33m3[39m }
{ name: [32m'LINEBREAK'[39m, image: [32m'\n  '[39m, startLine: [33m13[39m, startColumn: [33m10[39m }
{ name: [32m'LINEBREAK'[39m, image: [32m'\n    '[39m, startLine: [33m14[39m, startColumn: [33m9[39m }
{
  name: [32m'LINEBREAK'[39m,
  image: [32m'\n\n    '[39m,
  startLine: [33m15[39m,
  startColumn: [33m10[39m
}
{
  name: [32m'LINEBREAK'[39m,
  image: [32m'\n      '[39m,
  startLine: [33m17[39m,
  startColumn: [33m21[39m
}
{
  name: [32m'LINEBREAK'[39m,
  image: [32m'\n        '[39m,
  startLine: [33m18[39m,
  startColumn:

{
  name: [32m'LINEBREAK'[39m,
  image: [32m'\n              \n            '[39m,
  startLine: [33m43[39m,
  startColumn: [33m129[39m
}
{
  name: [32m'LINEBREAK'[39m,
  image: [32m'\n            '[39m,
  startLine: [33m45[39m,
  startColumn: [33m18[39m
}
{
  name: [32m'ANY'[39m,
  image: [32m'Most of my papers can be found at '[39m,
  startLine: [33m46[39m,
  startColumn: [33m32[39m
}
{
  name: [32m'ANY'[39m,
  image: [32m'researchgate.net'[39m,
  startLine: [33m46[39m,
  startColumn: [33m121[39m
}
{ name: [32m'ANY'[39m, image: [32m'.'[39m, startLine: [33m46[39m, startColumn: [33m141[39m }
{
  name: [32m'LINEBREAK'[39m,
  image: [32m'\n            '[39m,
  startLine: [33m46[39m,
  startColumn: [33m147[39m
}
{
  name: [32m'ANY'[39m,
  image: [32m'The programming language SetlX can be downloaded at '[39m,
  startLine: [33m47[39m,
  startColumn: [33m32[39m
}
{
  name: [32m'LINEBREAK'[39m,
  image: [32m'\n              '[39m,
  st

Each token object contains:

- `tokenType.name`: The type of token (e.g., "`LINEBREAK`", "`ANY`")

- `image`: The actual matched text from the source

- `startLine` and `startColumn`: Position information for debugging

This allows you to see exactly how Chevrotain broke down the HTML into individual tokens before processing.