In [None]:
import { display } from "tslab";
import { readFileSync } from "fs";

const css : string = readFileSync("../style.css", "utf8");
display.html(`<style>${css}</style>`);

# Converting <span style="font-variant:small-caps;">Html</span> to Text

This notebook shows how we can use **TypeScript** and **Regular Expressions** to extract the text that is embedded in an <span style="font-variant:small-caps;">Html</span> file.
In order to be concise, it only supports a small subset of 
<span style="font-variant:small-caps;">Html</span>. Below is the content of my old
<a href="http://wwwlehre.dhbw-stuttgart.de/~stroetma/">web page</a> that I had used when I was still working at the DHBW Stuttgart. The goal of this notebook is to write 
a scanner that is able to extract the text from this web page.

In [None]:
const data: string = `
<html>
  <head>
    <meta charset="utf-8">
    <title>Homepage of Prof. Dr. Karl Stroetmann</title>
    <link type="text/css" rel="stylesheet" href="style.css" />
    <link href="http://fonts.googleapis.com/css?family=Rochester&subset=latin,latin-ext"
          rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Pacifico&subset=latin,latin-ext"
          rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Cabin+Sketch&subset=latin,latin-ext" rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Sacramento" rel="stylesheet" type="text/css">
  </head>
  <body>
    <hr/>

    <div id="table">
      <header>
        <h1 id="name">Prof. Dr. Karl Stroetmann</h1>
      </header>

      <div id="row1">
        <div class="right">
          <a id="dhbw" href="http://www.ba-stuttgart.de">Duale Hochschule Baden-W&uuml;rttemberg</a>
          <br/>Coblitzallee 1-9
          <br/>68163 Mannheim
          <br/>Germany
	  <br>
          <br/>Office: &nbsp;&nbsp;&nbsp; Raum 344B
          <br/>Phone:&nbsp;&nbsp;&nbsp; +49 621 4105-1376
          <br/>Fax:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; +49 621 4105-1194
          <br/>Skype: &nbsp;&nbsp;&nbsp; karlstroetmann
        </div>  


        <div id="links">
          <strong class="some">Some links:</strong>
          <ul class="inlink">
            <li class="inlink">
	      My <a class="inlink" href="https://github.com/karlstroetmann?tab=repositories">lecture notes</a>,
              as well as the programs presented in class, can be found
              at <br>
              <a class="inlink" href="https://github.com/karlstroetmann?tab=repositories">https://github.com/karlstroetmann</a>.
              
            </li>
            <li class="inlink">Most of my papers can be found at <a class="inlink" href="https://www.researchgate.net/">researchgate.net</a>.</li>
            <li class="inlink">The programming language SetlX can be downloaded at <br>
              <a href="http://randoom.org/Software/SetlX"><tt class="inlink">http://randoom.org/Software/SetlX</tt></a>.
            </li>
          </ul>
        </div>
      </div>
    </div>
    
    <div id="intro">
      As I am getting old and wise, I have to accept the limits of
      my own capabilities.  I have condensed these deep philosophical
      insights into a most beautiful pearl of poetry.  I would like 
      to share these humble words of wisdom:
      
      <div class="poetry">
        I am a teacher by profession,    <br>
        mostly really by obsession;      <br>
        But even though I boldly try,    <br>
        I just cannot teach <a href="flying-pig.jpg" id="fp">pigs</a> to fly.</br>
        Instead, I slaughter them and fry.
      </div>
      
      <div class="citation">
        <div class="quote">
          Any sufficiently advanced poetry is indistinguishable from divine wisdom.
        </div>
        <div id="sign">His holiness Pope Hugo &#8555;.</div>
      </div>
    </div>
</div>

</body>
</html>
`;

In [None]:
display.html(data)

The original web page is still available at https://wwwlehre.dhbw-stuttgart.de/~stroetma/.

## Regular Expression Definitions

We define the token patterns using standard TypeScript `RegExp` objects. We need to identify tags, line breaks, and entities.

A key detail in these definitions is the use of the **sticky flag (`y`)**. This flag forces the match to happen **exactly** at the position specified by the regex's `lastIndex` property (which corresponds to our scanner's cursor). This ensures that we consume the input string token by token from the current position, without skipping ahead or searching elsewhere in the string. 

Because the `y` flag already enforces this strict positioning relative to our cursor, we do not need the start anchor `^`.

In [None]:
const REGEX_HEAD_START   = /<head>/y;
const REGEX_HEAD_END     = /<\/head>/y;
const REGEX_SCRIPT_START = /<script>/y;
const REGEX_SCRIPT_END   = /<\/script>/y;
const REGEX_TAG          = /<[^>]+>/y;

// Matches one or more lines of whitespace/newlines
const REGEX_LINEBREAK    = /(\s*\n\s*)+/y;
const REGEX_NAMED_ENTITY = /&[a-zA-Z]+;?/y;
const REGEX_UNICODE      = /&#[0-9]+;?/y;

## Entity Decoding

To handle HTML entities like `&uuml;` or `&#8555;`, we need a helper function. In a full browser environment, we could use the DOM, but for this standalone script, we define a map for common entities and a parser for numeric ones.

In [None]:
const html5Entities: Record<string, string> = {
    "auml": "ä",
    "ouml": "ö",
    "uuml": "ü",
    "Auml": "Ä",
    "Ouml": "Ö",
    "Uuml": "Ü",
    "nbsp": " "
};

In [None]:
function decodeEntity(entity: string): string {
    if (entity.startsWith("&#")) {
        // Numeric entity: &#123;
        const code = entity.replace(/[^0-9]/g, "");
        return String.fromCharCode(parseInt(code, 10));
    } else {
        // Named entity: &name;
        let name = entity.substring(1);
        if (name.endsWith(";")) name = name.slice(0, -1);
        return html5Entities[name] || entity;
    }
}

## Definition of the States

We need to track the state of our scanner because the rules change when we are inside a `<head>` or `<script>` tag. 

We define three states:
- `INITIAL`: The default state where we process text and look for tags.
- `HEADER`: We are inside the `<head>...</head>` block. We discard everything until the closing tag.
- `SCRIPT`: We are inside a `<script>...</script>` block. We discard everything until the closing tag.

In [None]:
type ScannerState = 'INITIAL' | 'HEADER' | 'SCRIPT';

## Scanner Implementation

The `scanHtml` function serves as the main controller for our Tokenizer. Instead of modifying the input string, it maintains a numeric `cursor` pointing to the current position. This pointer-based approach ensures efficient processing ($O(N)$).

### Helper: Sticky Matching

Since we are using the sticky flag (`y`), the regex engine requires us to manually set the start position for every match attempt via the `lastIndex` property. The `matchAt` helper encapsulates this logic to keep the main loop clean.

**Specs:**
* **Input:** The regex pattern, the full input string, and the current cursor position.
* **Action:** Sets `regex.lastIndex` to `cursor` and executes the check.
* **Output:** The `RegExpExecArray` if a match is found at that exact position, otherwise `null`.

In [None]:
function matchAt(regex: RegExp, input: string, cursor: number): RegExpExecArray | null {
    regex.lastIndex = cursor;
    return regex.exec(input);
}

### The Main Loop

The scanner iterates through the input until the cursor reaches the end. The logic is structured as a **Finite Automaton** using a `switch` statement on the current `state`.

**Logic Flow:**
1.  **State Dispatch:** The scanner checks the current `state` (`INITIAL`, `HEADER`, or `SCRIPT`).
2.  **INITIAL State:**
    * **Transitions:** Checks if we are entering a `<head>` or `<script>` block.
    * **Tokenizing:** Checks for tags (ignored), linebreaks (normalized), or entities (decoded).
    * **Default:** If no pattern matches, the current character is accepted as plain text content and added to the buffer.
3.  **HEADER / SCRIPT States:**
    * These states act as "consumers". They ignore all content (incrementing the cursor) until the specific closing tag is matched, which triggers a transition back to `INITIAL`.

In [None]:
function scanHtml(input: string) {
    let cursor = 0;
    let state: ScannerState = 'INITIAL';
    let outputBuffer = "";
    const len = input.length;

    while (cursor < len) {
        switch (state) {
            case 'INITIAL':
                // 1. Check for State changes
                if (matchAt(REGEX_HEAD_START, input, cursor)) {
                    state = 'HEADER';
                    cursor += 6; // Length of <head>
                    continue;
                }
                if (matchAt(REGEX_SCRIPT_START, input, cursor)) {
                    state = 'SCRIPT';
                    cursor += 8; // Length of <script>
                    continue;
                }

                // 2. Check for Tags
                const tagMatch = matchAt(REGEX_TAG, input, cursor);
                if (tagMatch) {
                    cursor += tagMatch[0].length;
                    continue;
                }

                // 3. Check for Linebreaks
                const lbMatch = matchAt(REGEX_LINEBREAK, input, cursor);
                if (lbMatch) {
                    outputBuffer += "\n";
                    cursor += lbMatch[0].length;
                    continue;
                }

                // 4. Check for Entities
                const uniMatch = matchAt(REGEX_UNICODE, input, cursor);
                if (uniMatch) {
                    outputBuffer += decodeEntity(uniMatch[0]);
                    cursor += uniMatch[0].length;
                    continue;
                }
                const entMatch = matchAt(REGEX_NAMED_ENTITY, input, cursor);
                if (entMatch) {
                    outputBuffer += decodeEntity(entMatch[0]);
                    cursor += entMatch[0].length;
                    continue;
                }

                // 5. Default: Print character
                outputBuffer += input[cursor];
                cursor++;
                break;

            case 'HEADER':
                if (matchAt(REGEX_HEAD_END, input, cursor)) {
                    state = 'INITIAL';
                    cursor += 7;
                } else cursor++;
                break;

            case 'SCRIPT':
                if (matchAt(REGEX_SCRIPT_END, input, cursor)) {
                    state = 'INITIAL';
                    cursor += 9;
                } else cursor++;
                break;
        }
    }
    console.log(outputBuffer);
}

## Running the Scanner

Finally, we run the scanner on the HTML data.

In [None]:
console.time("Scan");
for (let i : number = 0; i < 100; i++)
    scanHtml(data);
console.timeEnd("Scan");