In [1]:
import { display } from "tslab";
import { readFileSync } from "fs";

const css = readFileSync("../style.css", "utf8");
display.html(`<style>${css}</style>`);

# Converting HTML to Text (TypeScript Version)

This notebook demonstrates how to use **TypeScript** and the [`chevrotain`](https://chevrotain.io/) library  
to extract plain text from an HTML document.  

The goal is to build a simple **lexer (tokenizer)** that recognizes HTML tags  
and outputs only the text content.  
For simplicity, we only support a small but representative subset of HTML.

The original web page was created by [Prof. Dr. Karl Stroetmann](http://wwwlehre.dhbw-stuttgart.de/~stroetma/).  
In this exercise, we aim to extract readable text from the HTML source of that page.  

To achieve this, we implement a small **state machine** in TypeScript that distinguishes  
between different HTML sections such as `<head>`, `<script>`, and normal text.

In [2]:
const data = `
<html>
  <head>
    <meta charset="utf-8">
    <title>Homepage of Prof. Dr. Karl Stroetmann</title>
    <link type="text/css" rel="stylesheet" href="style.css" />
    <link href="http://fonts.googleapis.com/css?family=Rochester&subset=latin,latin-ext"
          rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Pacifico&subset=latin,latin-ext"
          rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Cabin+Sketch&subset=latin,latin-ext" rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Sacramento" rel="stylesheet" type="text/css">
  </head>
  <body>
    <hr/>

    <div id="table">
      <header>
        <h1 id="name">Prof. Dr. Karl Stroetmann</h1>
      </header>

      <div id="row1">
        <div class="right">
          <a id="dhbw" href="http://www.ba-stuttgart.de">Duale Hochschule Baden-W&uuml;rttemberg</a>
          <br/>Coblitzallee 1-9
          <br/>68163 Mannheim
          <br/>Germany
	  <br>
          <br/>Office: &nbsp;&nbsp;&nbsp; Raum 344B
          <br/>Phone:&nbsp;&nbsp;&nbsp; +49 621 4105-1376
          <br/>Fax:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; +49 621 4105-1194
          <br/>Skype: &nbsp;&nbsp;&nbsp; karlstroetmann
        </div>  


        <div id="links">
          <strong class="some">Some links:</strong>
          <ul class="inlink">
            <li class="inlink">
	      My <a class="inlink" href="https://github.com/karlstroetmann?tab=repositories">lecture notes</a>,
              as well as the programs presented in class, can be found
              at <br>
              <a class="inlink" href="https://github.com/karlstroetmann?tab=repositories">https://github.com/karlstroetmann</a>.
              
            </li>
            <li class="inlink">Most of my papers can be found at <a class="inlink" href="https://www.researchgate.net/">researchgate.net</a>.</li>
            <li class="inlink">The programming language SetlX can be downloaded at <br>
              <a href="http://randoom.org/Software/SetlX"><tt class="inlink">http://randoom.org/Software/SetlX</tt></a>.
            </li>
          </ul>
        </div>
      </div>
    </div>
    
    <div id="intro">
      As I am getting old and wise, I have to accept the limits of
      my own capabilities.  I have condensed these deep philosophical
      insights into a most beautiful pearl of poetry.  I would like 
      to share these humble words of wisdom:
      
      <div class="poetry">
        I am a teacher by profession,    <br>
        mostly really by obsession;      <br>
        But even though I boldly try,    <br>
        I just cannot teach <a href="flying-pig.jpg" id="fp">pigs</a> to fly.</br>
        Instead, I slaughter them and fry.
      </div>
      
      <div class="citation">
        <div class="quote">
          Any sufficiently advanced poetry is indistinguishable from divine wisdom.
        </div>
        <div id="sign">His holiness Pope Hugo &#8555;.</div>
      </div>
    </div>
</div>

</body>
</html>
`;

In [3]:
display.html(data)

The original web page is still available at https://wwwlehre.dhbw-stuttgart.de/~stroetma/.

## Imports

We will use the package [Chevrotain](https://chevrotain.io/documentation/0_7_2/index.html) to remove the 
<span style="font-variant:small-caps;">Html</span> tags and extract the text that
is embedded in the <span style="font-variant:small-caps;">Html</span> shown above.
In this example, we will only use the scanner that is provided by the module `Lexer`. 
Hence we import the module `Lexer` that contains the scanner generator from `Chevrotain`.

We use the **Chevrotain** library to tokenize the HTML source code.  
Chevrotain is a powerful toolkit for building lexers and parsers in TypeScript.

In this notebook, we only use the **lexer** functionality  
to break the HTML document into tokens.

In [4]:
const { execSync } = await import('child_process');
console.log(execSync('npm install chevrotain@10').toString());
console.log(execSync('npm install entities').toString());

npm notice
npm notice New major version of npm available! 10.8.2 -> 11.6.2
npm notice Changelog: https://github.com/npm/cli/releases/tag/v11.6.2
npm notice To update run: npm install -g npm@11.6.2
npm notice



up to date, audited 9 packages in 1s

1 package is looking for funding
  run `npm fund` for details

found 0 vulnerabilities






up to date, audited 9 packages in 976ms

1 package is looking for funding
  run `npm fund` for details

found 0 vulnerabilities



In [5]:
import { createToken,Lexer,ITokenConfig,TokenType, IToken} from "chevrotain";
import { decodeHTML } from "entities";

## Definition of the States

The lexer operates in multiple **modes (states)** that determine how HTML tokens are processed.

We define three main modes:

- `INITIAL` – the default mode for normal text and general HTML content  
- `header` – activated when the lexer is inside the `<head>` tag  
- `script` – activated when the lexer is inside a `<script>` block  

Each mode has its own set of token definitions to correctly distinguish  
between text, tags, and entities.

## Token Definitions

We proceed to give the definition of the tokens.  Note that none of the function defined below
returns a token.  Rather all of these function print the transformation of the 
<span style="font-variant:small-caps;">Html</span> that they have matched.

In this section, we will define the tokens needed to process our exam data.

Each token is created using Chevrotain's createToken function, which takes two main parameters:

name - A string identifying the token type
pattern - A regular expression that defines what strings this token matches

### The Definition of the Token `HEAD_START`

Once the scanner reads the opening tag `<head>` it switches into the state `header`.  The function `begin` of the lexer can be used to switch into a different scanner state.  In the state `header`, the scanner continues to read and discard characters until the closing tag `</head>` is encountered.  Note that this token is only recognized in the state `INITIAL`.  The state `INITIAL` is the initial state of the scanner, i.e. the scanner always starts in this state.

In [6]:
const HEAD_START = createToken({ 
    name: "HEAD_START",
    pattern: /<head>/i 
})

### The Definition of the Token `SCRIPT_START`

Once the scanner reads the opening tag `<script>` it switches into the state `script`.  In this state it will continue to read and discard characters until it sees the closing tag `</script>`.

In [7]:
const SCRIPT_START = createToken({ 
    name: "SCRIPT_START", 
    pattern: /<script\b[^>]*>/i 
});

### The Definition of the Token `LINEBREAK`

Groups of newline characters are condensed into a single newline character.
As we are not interested in the variable `t.lexer.lineno` in this example, we don't have to count the newlines.
This token is active in the `INITIAL` state.

In [8]:
const LINEBREAK = createToken({
  name: "LINEBREAK",
  pattern: /(\s*\n\s*)+/,
  line_breaks: true
});

### The Definition of the Token `TAG`

The token `TAG` is defined as any string that starts with the character `<` and ends with the character 
`>`. Betweens these two characters there has to be a nonzero number of characters that are different from 
the character `>`.  The text of the token is discarded.

In [9]:
const TAG = createToken({
  name: "TAG",
  pattern: /<[^>]+>/
});

### The Definition of the Token `NAMED_ENTITY`

In order to support named <span style="font-variant:small-caps;">Html</span> entities we need to import
the dictionary `html5` from the module `html.entities`.  For every named 
<span style="font-variant:small-caps;">Html</span> entity `e`, `html[e]` is the unicode symbol that is specified by `e`.

In [10]:
console.log(decodeHTML("&auml;")); // ä

ä


The regular expression `&[a-zA-Z]+;?` searches for <span style="font-variant:small-caps;">Html</span>
entity names.  These are strings that start with the character `&` followed by the name of the entity, optionally followed by the character `;`.  For example, `&auml;` is the entity name that specifies the German umlaut `ä`.  If a Unicode entity name is found, the corresponding character is printed.

In [11]:
const NAMED_ENTITY = createToken({
  name: "NAMED_ENTITY",
  pattern: /&[A-Za-z]+;?/
});

### The Definition of the Token `UNICODE` 

The regular expression `&\#[0-9]+;?` searches for <span style="font-variant:small-caps;">Html</span> entities that specify a unicode character numerically.  The corresponding strings start with the character `&`
followed by the character `#` followed by digits and are optionally ended by the character `;`.

Note that we had to escape the character `#` with a  backslash because otherwise this character would signal the begin of a comment.

Note further that the function `fromCodePoint` takes a number and returns the corresponding unicode character.
For example, `String.fromCodePoint(128034)` returns the character `'🐢'`. 

In [12]:
const UNICODE = createToken({
  name: "UNICODE",
  pattern: /&#[0-9]+;?/
});

In [13]:
String.fromCodePoint(8555)

Ⅻ


In [14]:
String.fromCodePoint(128034)

🐢


### The Definition of the Token `ANY` 

The regular expression `.` matches any character that is different from a newline character.  These characters are printed unmodified.  Note that the scanner tries the regular expressions for a given state in the order that they are defined in this notebook.  Therefore, it is crucial that the function `t_ANY` is defined after all other token definitions for the `INITIAL` state are given.  The `INITIAL` state is the default state of the scanner and therefore the state the scanner is in when it starts scanning.

In [15]:
const ANY = createToken({
  name: "ANY",
  pattern: /[^<&\r\n]+/
});

### The Definition of the Token `HEAD_END` 

The regular expression `</head>` matches the closing head tag.  Note that this regular expression is only
active in state `header` as the name of this function starts with `t_header`.  Once the closing tag has been found, the function `lexer.begin` switches the lexer back into the state `INITIAL`, which is the 
<em style="color:blue">start state</em> of the scanner.  In the state `INITIAL`, all token definitions are active, that do not start with either `t_header` or `t_script`.

In [16]:
const HEAD_END = createToken({
    name: "HEAD_END",
    pattern: /<\/head>/i 
});

### The Definition of the Token `SCRIPT_END`

The regular expression `</script>` matches the closing script tag.  Note that this regular expression is only
active in state `script`.  Once the closing tag has been found, the function `lexer.begin` switches the lexer back into the state `INITIAL`, which is the start state of the scanner.  

In [17]:
const SCRIPT_END = createToken({ 
    name: "SCRIPT_END",
    pattern: /<\/script>/i
});

## Error Handling

The function `t_error` is called when a substring at the beginning of the input can not be matched by any of the regular expressions defined in the various tokens.  In our implementation we print the first character that could not be matched, discard this character and continue.

<b>Note:</b>  Because of our definition for the token `ANY`, there can be no scanning **error**.

In [18]:
function t_error(char: string, offset: number) {
  console.error(`Illegal character '${char}' at position ${offset}`);
}

The function `t_header_error` is called when a substring at the beginning of the input can not be matched by any of the regular expressions defined in the various tokens and the scanner is in state `header`.  Actually, this function can never be called.

In [19]:
function t_header_error(char: string, offset: number) {
  console.error(`Illegal character in state 'header': '${char}' at position ${offset}`);
}

The function `t_script_error` is called when a substring at the beginning of the input can not be matched by any of the regular expressions defined in the various tokens and the scanner is in state `script`.  Actually, this function can never be called.

In [20]:
function t_script_error(char: string, offset: number) {
  console.error(`Illegal character in state 'script': '${char}' at position ${offset}`);
}

## Running the Scanner

Add all Tokens to the lexerDefinition.

In [21]:
const lexerDefinition = {
  defaultMode: "INITIAL",
  modes: {
    INITIAL: [
      HEAD_START,
      SCRIPT_START,
      LINEBREAK,
      TAG,
      NAMED_ENTITY,
      UNICODE,
      ANY
    ],
    header: [HEAD_END, { ...ANY, name: "HEADER_ANY" }],
    script: [SCRIPT_END, { ...ANY, name: "SCRIPT_ANY" }]
  }
};

The line below is necessary to trick `Lexer`. It creates the actual Chevrotain `Lexer` from your token definitions, and the option { ensureOptimizations: false } disables internal optimization checks for complex regular expressions.

In [22]:
const HtmlLexer = new Lexer(lexerDefinition, { ensureOptimizations: false });

These functions convert HTML entities into their corresponding characters — decodeNamedEntity handles named entities (like `&nbsp;`), while decodeUnicode converts numeric ones (like `&#160;`).

In [None]:
function decodeNamedEntity(entity: string): string {
  return decodeHTML(entity);
}

function decodeUnicode(entity: string): string {
  const num = parseInt(entity.replace(/[&#;]/g, ""), 10);
  return String.fromCharCode(num);
}

This function tokenizes an HTML string, processes each token based on its context (INITIAL, header, or script), decodes entities, ignores tags, and returns the cleaned plain text content.

In [23]:
function htmlToText(html: string): string {
  const lines: string[] = [];
  let text = "";

  try {
    const lexResult = HtmlLexer.tokenize(html);

    if (lexResult.errors && lexResult.errors.length > 0) {
      lexResult.errors.forEach((err: any) => {
        t_error(err.message, err.offset || -1);
      });
    }

    const tokens = lexResult.tokens;
    let mode = "INITIAL";
    let subMode: "title" | "style" | "script" | null = null; // NEU

    const allowedHeaderTags = ["meta", "title", "link", "style", "base", "script"];

    for (let i = 0; i < tokens.length; i++) {
      const t = tokens[i];
      const value = t.image;

      try {
        // === INITIAL Modus ===
        if (mode === "INITIAL") {
          if (/<head>/i.test(value)) {
            mode = "header";
            continue;
          }
          if (/<script\b/i.test(value)) {
            mode = "script";
            continue;
          }

          if (/<br\s*\/?>/i.test(value)) {
            text += "\n";
          } else if (/<\/p>|<p>/i.test(value)) {
            text += "\n\n";
          } else if (/<[^>]+>/.test(value)) {
            // Ignoriere andere HTML-Tags
          } else if (/&[A-Za-z]+;?/.test(value)) {
            text += decodeNamedEntity(value);
          } else if (/&#[0-9]+;?/.test(value)) {
            text += decodeUnicode(value);
          } else if (/\n/.test(value)) {
            text += "\n";
          } else if (value.trim() === "") {
            // Ignoriere Whitespace
          } else {
            text += value;
          }

        // === HEADER Modus ===
        } else if (mode === "header") {

          // Wenn wir in einem Unter-Tag (z.B. <title>) sind:
          if (subMode) {
            const endTag = new RegExp(`</${subMode}>`, "i");
            if (endTag.test(value)) {
              subMode = null; // zurück zu Header
            } else {
              // Inhalt innerhalb <title>, <style>, <script> → gültig
              continue;
            }

          } else if (/<\/head>/i.test(value)) {
            mode = "INITIAL";

          } else if (/<title>/i.test(value)) {
            subMode = "title";
          } else if (/<style>/i.test(value)) {
            subMode = "style";
          } else if (/<script>/i.test(value)) {
            subMode = "script";
          } else if (
            new RegExp(`<\\/?(?:${allowedHeaderTags.join("|")})\\b[^>]*>`, "i").test(value) ||
            /^\s*$/.test(value)
          ) {
            // gültige Header-Tags oder Whitespace
            continue;
          } else {
            t_header_error(value, t.startOffset);
          }

        // === SCRIPT Modus ===
        } else if (mode === "script") {
          if (/<\/script>/i.test(value)) {
            mode = "INITIAL";
          } else if (/^[^<>&]+$/.test(value) || /^\s*$/.test(value)) {
            continue;
          } else {
            t_script_error(value, t.startOffset);
          }
        }

      } catch (tokenErr) {
        t_error(value, t.startOffset);
      }
    }

    text = text
      .replace(/\u00A0/g, " ")
      .replace(/[ \t]+\n/g, "\n")
      .replace(/\n{3,}/g, "\n\n")
      .trim();

  } catch (err: any) {
    console.error("Fatal error during HTML parsing:", err.message);
  }

  return text;
}

43:21 - Cannot find name 'decodeNamedEntity'.
45:21 - Cannot find name 'decodeUnicode'.


Next, we feed our input string into the generated scanner.

In [None]:
console.log(htmlToText(data));

In order to scan the data that we provided in the last line, we iterate over all tokens generated by our scanner.

In [None]:
const lexResult = HtmlLexer.tokenize(data);

// Alle Tokens ausgeben:
for (const tok of lexResult.tokens) {
  console.log({
    name: tok.tokenType.name,
    image: tok.image,
    startLine: tok.startLine,
    startColumn: tok.startColumn
  });
}


This loop prints all lexer modes and their corresponding token names with regex patterns to the console.

In [None]:
for (const [mode, toks] of Object.entries(lexerDefinition.modes)) {
  console.log(`\nTokens im Modus: ${mode}`);
  for (const t of toks as any[]) {
    console.log(`  ${t.name} → ${t.PATTERN}`);
  }
}