In [None]:
import { display } from "tslab";
import { readFileSync } from "fs";

const css : string = readFileSync("../style.css", "utf8");
display.html(`<style>${css}</style>`);

# Converting HTML to Text

This notebook demonstrates how to build a simple but effective HTML-to-text converter using **TypeScript** and the parsing library [`lezer`](https://lezer.codemirror.net/).

## The Goal

Our objective is to extract the readable, plain text content from a given HTML document. We will use the HTML source code from the homepage of [Prof. Dr. Karl Stroetmann](http://wwwlehre.dhbw-stuttgart.de/~stroetma/) as our example data. 

To achieve this, we define a grammar in TypeScript that distinguishes  
between different HTML sections such as `<head>`, `<script>`, and normal text.

First, let's load our example HTML data:

In [None]:
const data: string = `
<html>
  <head>
    <meta charset="utf-8">
    <title>Homepage of Prof. Dr. Karl Stroetmann</title>
    <link type="text/css" rel="stylesheet" href="style.css" />
    <link href="http://fonts.googleapis.com/css?family=Rochester&subset=latin,latin-ext"
          rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Pacifico&subset=latin,latin-ext"
          rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Cabin+Sketch&subset=latin,latin-ext" rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Sacramento" rel="stylesheet" type="text/css">
  </head>
  <body>
    <hr/>

    <div id="table">
      <header>
        <h1 id="name">Prof. Dr. Karl Stroetmann</h1>
      </header>

      <div id="row1">
        <div class="right">
          <a id="dhbw" href="http://www.ba-stuttgart.de">Duale Hochschule Baden-W&uuml;rttemberg</a>
          <br/>Coblitzallee 1-9
          <br/>68163 Mannheim
          <br/>Germany
	  <br>
          <br/>Office: &nbsp;&nbsp;&nbsp; Raum 344B
          <br/>Phone:&nbsp;&nbsp;&nbsp; +49 621 4105-1376
          <br/>Fax:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; +49 621 4105-1194
          <br/>Skype: &nbsp;&nbsp;&nbsp; karlstroetmann
        </div>  


        <div id="links">
          <strong class="some">Some links:</strong>
          <ul class="inlink">
            <li class="inlink">
	      My <a class="inlink" href="https://github.com/karlstroetmann?tab=repositories">lecture notes</a>,
              as well as the programs presented in class, can be found
              at <br>
              <a class="inlink" href="https://github.com/karlstroetmann?tab=repositories">https://github.com/karlstroetmann</a>.
              
            </li>
            <li class="inlink">Most of my papers can be found at <a class="inlink" href="https://www.researchgate.net/">researchgate.net</a>.</li>
            <li class="inlink">The programming language SetlX can be downloaded at <br>
              <a href="http://randoom.org/Software/SetlX"><tt class="inlink">http://randoom.org/Software/SetlX</tt></a>.
            </li>
          </ul>
        </div>
      </div>
    </div>
    
    <div id="intro">
      As I am getting old and wise, I have to accept the limits of
      my own capabilities.  I have condensed these deep philosophical
      insights into a most beautiful pearl of poetry.  I would like 
      to share these humble words of wisdom:
      
      <div class="poetry">
        I am a teacher by profession,    <br>
        mostly really by obsession;      <br>
        But even though I boldly try,    <br>
        I just cannot teach <a href="flying-pig.jpg" id="fp">pigs</a> to fly.</br>
        Instead, I slaughter them and fry.
      </div>
      
      <div class="citation">
        <div class="quote">
          Any sufficiently advanced poetry is indistinguishable from divine wisdom.
        </div>
        <div id="sign">His holiness Pope Hugo &#8555;.</div>
      </div>
    </div>
</div>

</body>
</html>
`;

In [None]:
display.html(data)

The original web page is still available at https://wwwlehre.dhbw-stuttgart.de/~stroetma/.

## Imports

Before we can build our HTML lexer, we need to install and import the necessary packages. 
We use `lezer` as a scanner, `entities` to decode HTML entities and `RecursiveSet` for logic management, as specified in the requirements.

In [None]:
import { buildParser } from '@lezer/generator';
import { Tree, TreeCursor } from '@lezer/common';
import { LRParser } from '@lezer/lr';
import { decodeHTML } from 'entities';
import { RecursiveSet } from "recursive-set";

## Token Declarations
We begin by declaring the tokens. tokens are declared as a list of strings, in `lezer` we define the structure of our document in the `@top` block and list all possible token types in the `token` block.

* `HeadStart` will match the tag `<head>` that starts the definition of the HTML header.
* `HeadEnd` will match the tag `</head>` that ends the definition of the HTML header.
* `ScriptStart` will match the tag `<script>` (including attributes) that starts embedded *JavaScript* code.
* `ScriptEnd` will match the tag `</script>` that ends embedded *JavaScript* code.
* `Tag` is a token that represents arbitrary HTML tags.
* `LineBreak` is a token that will match newline characters.
* `NamedEntity` is a token that represents named HTML entities.
* `UnicodeEntity` is a token that represents a unicode entity.
* `Text` is a token that matches any other character content.

**Important:** We use a `@precedence` block to resolve overlapping conflicts (e.g. `<` in `<head>` vs. `<` as text).

### 1. Document Structure

First, we define the entry point (`@top`).

In [None]:
const entryPoint: string = `
  @top Document { token* }
`;

const tokenStructure: string = `
  token {
    HeadStart | HeadEnd |
    ScriptStart | ScriptEnd |
    Tag |
    LineBreak |
    NamedEntity |
    UnicodeEntity |
    Text
  }
`;

## Token Definitions

We proceed to give the definition of the tokens inside the `@tokens` block.

### The Definition of Structure Tokens

```
We define specific tokens for structural elements like `<head>` and `<script>`. These are crucial because they trigger state changes in our scanner (e.g., ignoring content inside the header).
```

In [None]:
const structureTokens: string = `
    // 1. Specific Structure Tags
    HeadStart { "<head>" }
    HeadEnd { "</head>" }
    
    // Script Start: <script followed by any char except >, ends with >
    ScriptStart { "<script" ![>]* ">" }
    ScriptEnd { "</script>" }
`;

### The Definition of General Tags

The `Tag` token matches any generic HTML tag that wasn't caught by the specific rules above. It starts with `<` and ends with `>`.

In [None]:
const generalTags: string = `
    // 2. General Tags
    Tag { "<" ![>]+ ">" }
`;

### The Definition of Line Breaks

We define `LineBreak` to capture newlines. We also include surrounding whitespace/tabs to clean up the output formatting.

In [None]:
const lineBreaks: string = `
    // 3. Line Breaks
    // Matches a sequence containing at least one newline.
    LineBreak { $[ \\t]* "\\n" $[ \\t\\n]* }
`;

### The Definition of Entities

We define tokens for both named entities (like `&amp;`) and numeric unicode entities (like `&#1234;`).

In [None]:
const entities: string = `
    // 4. Entities
    NamedEntity { "&" $[a-zA-Z]+ ";"? }
    UnicodeEntity { "&#" $[0-9]+ ";"? }
`;

### The Definition of Text

The `Text` token is our "catch-all" for content. It matches any sequence of characters that isn't a tag start (`<`), entity start (`&`), or newline.
Crucially, we also allow single `<` or `&` characters here as a fallback if they didn't match a valid tag or entity rule (e.g., in "1 < 2").

In [None]:
const textContent: string = `
    // 5. Text
    // Text is anything that is not a tag start, entity start or newline.
    // Individual < or & are accepted as fallback text.
    Text { 
      ![<&\\n]+ | 
      $[<&] 
    }
`;

### Conflict Resolution (Precedence)

In `lezer`, when multiple tokens could match the same input (e.g., `<head>` matches both `HeadStart` and `Tag`), we need to explicitly define precedence.
We assign the highest priority to specific structure tags, followed by general tags, and finally text.

In [None]:
const precedence: string = `
    // Conflict Resolution: Specific tags win over general tags and text
    @precedence { 
      HeadStart, 
      HeadEnd, 
      ScriptStart, 
      ScriptEnd, 
      Tag, 
      LineBreak,
      NamedEntity,
      UnicodeEntity,
      Text 
    }
`;

## Assembling the Grammar
Finally, we combine all these parts into the complete grammar string that `lezer` requires.

In [None]:
const htmlGrammar: string =
    entryPoint +
    tokenStructure +
    " @tokens { " +
    structureTokens +
    generalTags +
    lineBreaks +
    entities +
    textContent +
    precedence +
    " } " +
    " @skip {} ";

In [None]:
htmlGrammar

## Helper Functions

To convert the entities back to readable text, we use these helper functions:

In [None]:
function decodeUnicode(unicode: string): string {
    const code = parseInt(unicode, 10);
    if (isNaN(code)) return "";
    return String.fromCodePoint(code);
}

In [None]:
decodeHTML("&auml;");

In [None]:
decodeUnicode("8555");

In [None]:
decodeUnicode("128034");

## The Scanner Class

We encapsulate the logic in a class `HtmlToTextConverter`. This class manages:

1. The **Parser** (created from the grammar).
2. The **State** (`state`): We distinguish between `INITIAL` (Normal Text), `HEADER` (inside `<head>`), and `SCRIPT`.
3. The **Output**: A list of strings that is joined at the end.

We use `RecursiveSet` to efficiently define which token types should actually be printed in the normal text flow (Text, LineBreaks, and Entities).


In [None]:
interface HtmlToken {
    type: string;
    value: string;
    start: number;
    end: number;
    line: number;
}

enum ScannerState {
    INITIAL = "INITIAL",
    HEADER = "header",
    SCRIPT = "script",
}

class HtmlToTextConverter {
    private parser: LRParser;
    private state: ScannerState;
    private output: string[];

    // Nutzung von RecursiveSet wie gefordert [file:2]
    private readonly printableTokens: RecursiveSet<string>;

    constructor() {
        try {
            this.parser = buildParser(htmlGrammar);
        } catch (e: unknown) {
            // 'unknown' statt 'any' für Typ-Sicherheit
            console.error("Grammar Generation Error:", e);
            throw e;
        }
        this.state = ScannerState.INITIAL;
        this.output = [];

        // KORREKTUR: Leerer Konstruktor, Elemente mit .add() hinzufügen [file:2]
        this.printableTokens = new RecursiveSet<string>();
        this.printableTokens.add("Text");
        this.printableTokens.add("LineBreak");
        this.printableTokens.add("NamedEntity");
        this.printableTokens.add("UnicodeEntity");
    }

    private getLineNumber(source: string, offset: number): number {
        let lines = 1;
        for (let i = 0; i < offset; i++) {
            if (source[i] === "\n") lines++;
        }
        return lines;
    }

    private extractTokens(tree: Tree, source: string): HtmlToken[] {
        const cursor: TreeCursor = tree.cursor();
        const tokens: HtmlToken[] = [];
        do {
            if (cursor.name === "Document" || cursor.name === "token") continue;

            tokens.push({
                type: cursor.name,
                value: source.substring(cursor.from, cursor.to),
                start: cursor.from,
                end: cursor.to,
                line: this.getLineNumber(source, cursor.from),
            });
        } while (cursor.next());
        return tokens;
    }

    private processTokenState(type: string): void {
        switch (type) {
            case "HeadStart":
                this.state = ScannerState.HEADER;
                break;
            case "HeadEnd":
                this.state = ScannerState.INITIAL;
                break;
            case "ScriptStart":
                this.state = ScannerState.SCRIPT;
                break;
            case "ScriptEnd":
                this.state = ScannerState.INITIAL;
                break;
        }
    }

    private processTokenOutput(token: HtmlToken): void {
        if (this.state !== ScannerState.INITIAL) return;

        // Prüfen ob Token im Set ist (effizienter Lookup durch RecursiveSet)
        if (!this.printableTokens.has(token.type)) return;

        switch (token.type) {
            case "LineBreak":
                this.output.push("\n");
                break;

            case "NamedEntity":
                // Fall 1: &auml; -> decodeHTML (Standard Library)
                this.output.push(decodeHTML(token.value));
                break;

            case "UnicodeEntity":
                // Fall 2: &#8555; -> Wir entfernen Syntax und nutzen decodeUnicode
                // Regex entfernt alles, was keine Ziffer ist (&# und ;)
                const numericPart = token.value.replace(/[^\d]/g, "");
                this.output.push(decodeUnicode(numericPart));
                break;

            case "Text":
                this.output.push(token.value);
                break;
        }
    }

    public convert(htmlContent: string): string {
        this.state = ScannerState.INITIAL;
        this.output = [];

        let tree: Tree;
        try {
            tree = this.parser.parse(htmlContent);
        } catch (e: unknown) {
            // Error Handling ohne 'any'
            console.error("Parse Error:", e);
            return "";
        }

        const tokens = this.extractTokens(tree, htmlContent);

        for (const token of tokens) {
            this.processTokenState(token.type);
            this.processTokenOutput(token);
        }

        return this.output.join("");
    }

    public convertAndPrint(htmlContent: string): void {
        const result = this.convert(htmlContent);
        console.log(result);
        // Optional: display.text(result) falls in tslab Umgebung
    }
}

## Running the Scanner

Now we instantiate the converter and process our HTML data.

In [None]:
const converter = new HtmlToTextConverter();

In [None]:
converter.convertAndPrint(data);