In [1]:
import { display } from "tslab";
import { readFileSync } from "fs";

const css = readFileSync("../style.css", "utf8");
display.html(`<style>${css}</style>`);

# Regular Expressions in TypeScript (A Short Tutorial)

This tutorial provides an in-depth exploration of how regular expressions are implemented in TypeScript. 
It is assumed that the reader is already familiar with the fundamental concepts of [regular expressions](https://en.wikipedia.org/wiki/Regular_expression), typically covered in formal language courses such as [Formal Languages and Their Application](https://github.com/karlstroetmann/Formal-Languages/blob/master/Lecture-Notes/formal-languages.pdf). The focus here is to bridge the gap between the theoretical understanding of regular expressions and their practical application within the TypeScript programming environment.

In TypeScript, regular expressions are integrated into the core language as first-class citizens, accessible through regex literals and the `RegExp` object.

## Regular Expressions as Formal Languages

Regular expressions serve as textual patterns that define <em style="color:blue">languages</em>. In this context, a <em style="color:blue">language</em> is understood as a specific <em style="color:blue">set of strings</em>. For the ensuing discussion, let $\Sigma$ be the universal set of all Unicode characters, and $\Sigma^*$ the set comprising all strings formed from these Unicode characters. We will inductively define the set $\textrm{RegExp}$ of regular expressions.

To elucidate the semantics of a given regular expression \(r\), we introduce a function

$$\mathcal{L}: \textrm{RegExp} \rightarrow 2^{\Sigma^*}$$

where $\mathcal{L}(r)$ denotes the <em style="color:blue">language</em> represented by the regular expression $r$.

## The `match()` Method

To illustrate the functionality of regular expressions, we will employ the `match()` method from TypeScript's `String` class. The basic syntax is:

```"inputText".match(/pattern/flags)```

In this expression, the parameters are interpreted in the following manner:
- /pattern/ is a regular expression literal that defines the search pattern.
- "Input_Text" is the target string in which we want to locate substrings matching the given pattern.
- flags are optional modifiers that influence how the regular expression behaves.

The components are:
- `/pattern/` - A regular expression literal defining the search pattern
- `"inputText"` - The target string in which we search for matches
- `flags` - Optional modifiers that influence matching behavior

Common flags include:
- `g` - **global search**: find all matches instead of just the first
- `i` - **ignore case**: perform case-insensitive matching  
- `m` - **multiline mode**: anchors `^` and `$` match line boundaries

When used with the `g` flag, `match()` returns an array of all non-overlapping substrings that match the regular expression.

**Example:** The regular expression `/a/g` searches for all lowercase `a` characters:

In [2]:
"abcabcABC".match(/a/g);

[ [32m'a'[39m, [32m'a'[39m ]


This returns `['a', 'a']`, matching only the two lowercase letters.

In the next example, the flags `gi` are combined for global, case-insensitive matching:

In [3]:
"abcabcABC".match(/a/gi);

[ [32m'a'[39m, [32m'a'[39m, [32m'A'[39m ]


This returns `['a', 'a', 'A']`, matching all three occurrences regardless of case.

## Meta-Characters

To commence our investigation into the set $\textrm{RegExp}$, we first define the set $\texttt{MetaChars}$ - the collection of all meta-characters used in regular expressions:
```
MetaChars := { '.', '^', '$', '*', '+', '?', '{', '}', '[', ']', '\', '|', '(', ')' }
```
These characters have special syntactic meanings in regular expressions:
- `.` matches any character except newlines
- `^` and `$` mark the start and end of a string (or line with the `m` flag)
- `*`, `+`, and `?` define quantifiers
- `{}` specify repetition ranges
- `[]` define character classes
- `|` is the alternation (OR) operator
- `()` create groupings for subpatterns or capture groups
- `\` escapes special characters

## Basic Regular Expressions

Now we can start our inductive definition of regular expressions:

**1. Literal Characters:**
- Any Unicode character $c$ such that $c \not\in \textrm{MetaChars}$ is a regular expression matching that character:
$$\mathcal{L}(c) = \{ c \}$$

**2. Escaped Meta-Characters:**
- If $c$ is a meta-character (i.e., $c \in \textrm{MetaChars}$), then $\backslash c$ is a regular expression matching the literal character $c$:
$$\mathcal{L}(\backslash c) = \{ c \}$$

**Example:** To match a literal `+` symbol in the string `"1+1=2"`:

In [4]:
"1+1=2".match(/\+/g);

[ [32m'+'[39m ]


This returns `['+']`, matching the plus sign which would otherwise be interpreted as a quantifier.

## Concatenation

The next rule shows how regular expressions can be <em style="color:blue">concatenated</em>:
- If $r_1$ and $r_2$ are regular expressions, then $r_1r_2$ is a regular expression. This
  regular expression matches any string $s$ that can be split into two substrings $s_1$ and $s_2$ 
  such that $r_1$ matches $s_1$ and $r_2$ matches $s_2$. Formally, we have
  $$\mathcal{L}(r_1r_2) := 
    \bigl\{ s_1 \cdot s_2 \mid s_1 \in \mathcal{L}(r_1) \wedge s_2 \in \mathcal{L}(r_2) \bigr\}.
  $$
  
In formal language theory, the notation $r_1 \cdot r_2$ is often used, but in TypeScript we simply write $r_1r_2$ by placing the expressions side-by-side.

Using concatenation of regular expressions, we can now find words:

In [5]:
"The horse, the dog, and the cat.".match(/the/gi);

[ [32m'The'[39m, [32m'the'[39m, [32m'the'[39m ]


This returns `['he' 'the', 'the']`, matching all occurrences of "the" regardless of case.

## Choice

Regular expressions provide the operator `|` that can be used to choose between 
<em style="color:blue">alternatives</em>:
- If $r_1$ and $r_2$ are regular expressions, then $r_1|r_2$ is a regular expression. This
  regular expression matches any string $s$ that is matched by either $r_1$ or $r_2$.
  Formally, we have
  $$\mathcal{L}(r_1|r_2) := \mathcal{L}(r_1) \cup \mathcal{L}(r_2). $$
  
In formal language theory, the notation $r_1 + r_2$ is often used to denote choice, but TypeScript uses the pipe symbol `|` (alternation operator).

In [6]:
"The horse, the dog, and a cat.".match(/the|a/gi);

[ [32m'The'[39m, [32m'the'[39m, [32m'a'[39m, [32m'a'[39m, [32m'a'[39m ]


This returns `['The', 'the', 'a', 'a', 'a']`, matching either "the" or any single `a` character (case-insensitive). The five matches are: "The", "the", the `a` in "and", the standalone "a", and the `a` in "cat".

## Quantifiers

The most interesting regular expression operators are the <em style="color:blue">quantifiers</em>.
The official documentation calls them <em style="color:blue">repetition qualifiers</em>, but in this notebook 
they are called **quantifiers** for brevity. Syntactically, quantifiers are 
<em style="color:blue">postfix operators</em>.

### The Plus Quantifier (`+`)

- If $r$ is a regular expression, then $r+$ is a regular expression. This
  regular expression matches any string $s$ that can be split into a list of $n$ substrings $s_1$, 
  $s_2$, $\cdots$, $s_n$ such that $r$ matches $s_i$ for all $i \in \{1,\cdots,n\}$. 
  Formally, we have
  $$\mathcal{L}(r+) := 
    \Bigl\{ s \Bigm| \exists n \in \mathbb{N}: \bigl(n \geq 1 \wedge 
            \exists s_1,\cdots,s_n : (s_1 \cdots s_n = s \wedge 
             \forall i \in \{1,\cdots, n\}: s_i \in \mathcal{L}(r)\bigr)  
    \Bigr\}.
  $$

Informally, $r+$ matches $r$ **one or more times** (any positive number of times).

In [7]:
"abaabaAaba".match(/a+/gi);

[ [32m'a'[39m, [32m'aa'[39m, [32m'aAa'[39m, [32m'a'[39m ]


This returns `['a', 'aa', 'aAa', 'a']`. Because the `+` quantifier is **greedy**, it matches as many consecutive `a` characters as possible (case-insensitive due to the `i` flag). For example, the sequence "aAa" is matched as a single group rather than three separate matches.

### The Star Quantifier (`*`)

- If $r$ is a regular expression, then $r*$ is a regular expression. This
  regular expression matches either the empty string or any string $s$ that can be split into a list of $n$ substrings $s_1$, 
  $s_2$, $\cdots$, $s_n$ such that $r$ matches $s_i$ for all $i \in \{1,\cdots,n\}$. 
  Formally, we have
  $$\mathcal{L}(r*) := \bigl\{ \texttt{''} \bigr\} \cup
    \Bigl\{ s \Bigm| \exists n \in \mathbb{N}: \bigl(n \geq 1 \wedge 
            \exists s_1,\cdots,s_n : (s_1 \cdots s_n = s \wedge 
             \forall i \in \{1,\cdots, n\}: s_i \in \mathcal{L}(r)\bigr)  
    \Bigr\}.
  $$
  
Informally, $r*$ matches $r$ **zero or more times**. Therefore, in the following example the result also contains empty strings. For instance, in the string `"abaabbaaaba"`, the regular expression `/a*/g` will find an empty string at the beginning of each occurrence of the character `b`. The final empty string is found at the end of the input string:

In [8]:
"abaabbaaaba".match(/a*/g);

[
  [32m'a'[39m, [32m''[39m,  [32m'aa'[39m,
  [32m''[39m,  [32m''[39m,  [32m'aaa'[39m,
  [32m''[39m,  [32m'a'[39m, [32m''[39m
]


This returns `['a', '', 'aa', '', '', 'aaa', '', 'a', '']`, including empty strings where zero `a` characters are matched.

### The Question Mark Quantifier (`?`)

- If $r$ is a regular expression, then $r?$ is a regular expression. This
  regular expression matches either the empty string or any string $s$ that is matched by $r$. Formally we have
  $$\mathcal{L}(r?) := \bigl\{ \texttt{''} \bigr\} \cup \mathcal{L}(r). $$
  
Informally, $r?$ matches $r$ **zero or one time** (i.e., at most once). Therefore, in the following example the result contains two empty strings: one before the character `b`, and one at the end of the string.

In [9]:
"abaa".match(/a?/g);

[ [32m'a'[39m, [32m''[39m, [32m'a'[39m, [32m'a'[39m, [32m''[39m ]


This returns `['a', '', 'a', 'a', '']`.

### Range Quantifiers (`{m,n}`)

- If $r$ is a regular expression and $m,n\in\mathbb{N}$ such that $m \leq n$, then $r\{m,n\}$ is a 
  regular expression. This regular expression matches any number $k$ of repetitions of $r$ such that $m \leq k \leq n$.
  Formally, we have
  $$\mathcal{L}(r\{m,n\}) =
    \Bigl\{ s \mid \exists k \in \mathbb{N}: \bigl(m \leq k \leq n \wedge 
            \exists s_1,\cdots,s_k : (s_1 \cdots s_k = s \wedge 
             \forall i \in \{1,\cdots, k\}: s_i \in \mathcal{L}(r)\bigr)  
    \Bigr\}.
  $$
  
Informally, $r\{m,n\}$ matches $r$ **at least $m$ times and at most $n$ times**.

In [10]:
"aaaa".match(/a{2,3}/g);

[ [32m'aaa'[39m ]


This returns `['aaa']`. The regular expression `/a{2,3}/g` greedily matches the string `"aaaa"` by consuming the first three consecutive `a` characters (the maximum allowed). The remaining single `a` does **not** match because it falls short of the minimum requirement of 2 characters.

### Exact Count Quantifier (`{n}`)

If $r$ is a regular expression and $n\in\mathbb{N}$, then $r\{n\}$ is a regular expression. This regular expression matches **exactly $n$ repetitions** of $r$. Formally, we have
$$\mathcal{L}(r\{n\}) = \mathcal{L}(r\{n,n\}).$$

In [11]:
"aabaaaba".match(/a{2}/g);

[ [32m'aa'[39m, [32m'aa'[39m ]


This returns `['aa', 'aa']`, matching exactly two consecutive `a` characters each time.

### Up-To Quantifier (`{,n}`)

If $r$ is a regular expression and $n\in\mathbb{N}$, then $r\{,n\}$ is a regular expression. This regular expression matches **up to $n$ repetitions** of $r$ (i.e., between 0 and $n$ times). Formally, we have
$$\mathcal{L}(r\{,n\}) = \mathcal{L}(r\{0,n\}).$$

In [12]:
"aabaaabba".match(/a{0,2}/g);

[
  [32m'aa'[39m, [32m''[39m, [32m'aa'[39m,
  [32m'a'[39m,  [32m''[39m, [32m''[39m,
  [32m'a'[39m,  [32m''[39m
]


This returns `['aa', '', 'aa', 'a', '', '', 'a', '']`, matching zero, one, or two consecutive `a` characters. Empty strings appear where zero `a` characters are matched.

### At-Least Quantifier (`{n,}`)

If $r$ is a regular expression and $n\in\mathbb{N}$, then $r\{n,\}$ is a regular expression. This regular expression matches **$n$ or more repetitions** of $r$. Formally, we have
$$\mathcal{L}(r\{n,\}) = \mathcal{L}(r\{n\}r*).$$

In [13]:
"aabaaaba".match(/a{2,}/g);

[ [32m'aa'[39m, [32m'aaa'[39m ]


This returns `['aa', 'aaa']`, matching sequences of two or more consecutive `a` characters.

---

**💡 Syntax Note**

TypeScript/JavaScript regex supports:
- `{n,}` — matches *n* or more times ✓
- `{0,n}` — matches up to *n* times ✓

**Not supported:**
- `{,n}` — this syntax is **invalid** in TypeScript ✗

Always write `{0,n}` explicitly when you need "up to *n* times" matching.

---

## Non-Greedy Quantifiers

The quantifiers `?`, `+`, `*`, `{m,n}`, `{n}`, `{0,n}`, and `{n,}` are <em style="color:blue">greedy</em> by default—they 
match the **longest possible substring**. Suffixing any of these quantifiers with the character `?` makes them 
<em style="color:blue">non-greedy</em> (also called **lazy** or **reluctant**), causing them to match the **shortest possible substring**.

For example, the regular expression `/a{2,3}?/` can match either two or three occurrences of the character `a`, but will **prefer to match only two**. Hence, the regular expression `/a{2,3}?/g` will find two matches in the string `"aaaa"`, while the greedy version `/a{2,3}/g` only finds a single match.

**Non-greedy example:**

In [14]:
"aaaa".match(/a{2,3}?/g);

[ [32m'aa'[39m, [32m'aa'[39m ]


This returns `['aa', 'aa']` because the non-greedy quantifier stops at the minimum (2 characters) each time.

**Greedy example (for comparison):**

In [15]:
"aaaa".match(/a{2,3}/g);

[ [32m'aaa'[39m ]


This returns `['aaa']` because the greedy quantifier consumes the maximum possible (3 characters) in the first match, leaving only one `a` which **does not meet the minimum of 2** and therefore is not matched.

## Character Classes

In order to match a set of characters, we can use a <em style="color:blue">character class</em>.
If $c_1$, $\cdots$, $c_n$ are Unicode characters, then $[c_1\cdots c_n]$ is a regular expression that 
matches any of the characters from the set $\{c_1,\cdots,c_n\}$:
$$ \mathcal{L}\bigl([c_1\cdots c_n]\bigr) := \{ c_1, \cdots, c_n \} $$

In [16]:
"abcdcba".match(/[abc]+/g);

[ [32m'abc'[39m, [32m'cba'[39m ]


This returns `['abc', 'cba']`, matching sequences of the characters `a`, `b`, or `c`.

### Character Ranges

Character classes can also contain <em style="color:blue">ranges</em>. Syntactically, a range has the form $c_1\texttt{-}c_2$, where $c_1$ and $c_2$ are Unicode characters.

For example, the regular expression `/[0-9]/` contains the range `0-9` and matches any decimal digit. To find all natural numbers embedded in a string, we could use the regular expression `/[1-9][0-9]*|0/g`. This regular expression matches either:
- A single digit `0`, **or**
- A string that starts with a non-zero digit (`[1-9]`) followed by any number of additional digits (`[0-9]*`)

In [17]:
"11 abc 12 2345 007 42 0".match(/[1-9][0-9]*|0/g)

[
  [32m'11'[39m,   [32m'12'[39m,
  [32m'2345'[39m, [32m'0'[39m,
  [32m'0'[39m,    [32m'7'[39m,
  [32m'42'[39m,   [32m'0'[39m
]


This returns `['11', '12', '2345', '0', '0', '7', '42', '0']`. Note that `"007"` is split into `'0'`, `'0'`, `'7'` and matched separately because each leading zero is matched individually before the non-zero digits.

**Important: Order matters in alternation!**

The next example looks similar but gives a very different result:

In [18]:
"11 abc 12 2345 007 42 0".match(/[0-9]|[1-9][0-9]*/g)

[
  [32m'1'[39m, [32m'1'[39m, [32m'1'[39m, [32m'2'[39m,
  [32m'2'[39m, [32m'3'[39m, [32m'4'[39m, [32m'5'[39m,
  [32m'0'[39m, [32m'0'[39m, [32m'7'[39m, [32m'4'[39m,
  [32m'2'[39m, [32m'0'[39m
]


This returns `'1', '1', '1', '2', '2', '3', '4', '5', '0', '0', '7', '4', '2', '0'` - a list of individual digits!

Here's why: The regular expression starts with the alternative `[0-9]`, which matches any single digit. As soon as one digit is found, the match is returned and the search continues from the end of that match. The second alternative (`[1-9][0-9]*`) never gets a chance to match because the first alternative always succeeds first.

**Key lesson:** When using alternation (`|`), place more specific patterns before more general ones.

### Predefined Character Classes

TypeScript provides several predefined character classes as escape sequences:

- `\d` — matches any digit (equivalent to `[0-9]`)
- `\D` — matches any non-digit character
- `\s` — matches any whitespace character (spaces, tabs, newlines, etc.)
- `\S` — matches any non-whitespace character
- `\w` — matches any alphanumeric character or underscore (for ASCII: equivalent to `[0-9a-zA-Z_]`)
- `\W` — matches any non-alphanumeric character
- `\b` — matches at a **word boundary** (the position between a word character and a non-word character). The matched string is empty—it's a zero-width assertion.
- `\B` — matches at any position that is **not** a word boundary. Again, the matched string is empty.

These escape sequences can also be used inside square brackets:

In [19]:
"11 abc12 1a2 2b3c4d5".match(/[\dabcde]+/g)

[ [32m'11'[39m, [32m'abc12'[39m, [32m'1a2'[39m, [32m'2b3c4d5'[39m ]


This returns `['11', 'abc12', '1a2', '2b3c4d5']`, matching sequences containing digits or the letters `a`, `b`, `c`, `d`, `e`.

### Negated Character Classes

Character classes can be **negated** by placing the caret symbol `^` immediately after the opening bracket `[`. For example, `[^abc]` matches any character that is **not** `a`, `b`, or `c`.

In [20]:
"axyzbuvwchij".match(/[^abc]+/g);

[ [32m'xyz'[39m, [32m'uvw'[39m, [32m'hij'[39m ]


This returns `['xyz', 'uvw', 'hij']`, matching sequences that don't contain `a`, `b`, or `c`.

### Word Boundaries in Practice

The `\b` character class is particularly useful for extracting complete words:

In [21]:
"This is some text where we want to extract the words.".match(/\b\w+\b/g)

[
  [32m'This'[39m,    [32m'is'[39m,
  [32m'some'[39m,    [32m'text'[39m,
  [32m'where'[39m,   [32m'we'[39m,
  [32m'want'[39m,    [32m'to'[39m,
  [32m'extract'[39m, [32m'the'[39m,
  [32m'words'[39m
]


This returns `['This', 'is', 'some', 'text', 'where', 'we', 'want', 'to', 'extract', 'the', 'words']`, capturing all complete words.

### Using Word Boundaries to Match Numbers

The following regular expression uses `\b` to isolate complete numbers. Note that we must use parentheses because **concatenation binds more tightly than alternation** (`|`):

In [22]:
"11 abc 12 2345 007 42 0".match(/\b([1-9][0-9]*|0)\b/g)

[ [32m'11'[39m, [32m'12'[39m, [32m'2345'[39m, [32m'42'[39m, [32m'0'[39m ]


This returns `['11', '12', '2345', '42', '0']`.

**Important observation:** The number `"007"` is **not matched** because the pattern `([1-9][0-9]*|0)` only accepts:
- Numbers starting with a non-zero digit (`[1-9][0-9]*`), **or**
- A single zero (`0`)

The sequence `"007"` doesn't fit either category—it starts with zero but contains more than one digit. The word boundary `\b` recognizes `"007"` as a complete numeric token, but the inner pattern rejects it as invalid.

Contrast this with the alternative where order matters:

In [23]:
"11 abc 12 2345 007 42 0".match(/\b([0-9]|[1-9][0-9]*)\b/g)

[ [32m'11'[39m, [32m'12'[39m, [32m'2345'[39m, [32m'42'[39m, [32m'0'[39m ]


This also returns `['11', '12', '2345', '42', '0']` - the same result!

Why? Because the word boundaries `\b` force the regex to match **complete numeric tokens**. Even though the first alternative `[0-9]` matches a single digit, the word boundary on the right side (`\b`) prevents it from matching just one digit when more digits follow. So after matching the first digit, the regex engine backtracks and tries the second alternative `[1-9][0-9]*`, which successfully matches the entire number.

**Key takeaway:** Word boundaries change the matching behavior significantly. In this case, both patterns produce the same result because `\b` enforces complete token matching, causing the regex engine to backtrack and find the longest match. However, without word boundaries (as shown in earlier examples), the order of alternatives would matter greatly.

## Grouping

If $r$ is a regular expression, then $(r)$ is a regular expression describing the same language as $r$. There are two main reasons for using parentheses for grouping:

**1. To Override Operator Precedence:**
- Parentheses can be used to override the default precedence of operators. This concept is the same as in programming languages. For example, the regular expression `/ab+/` matches the character `a` followed by one or more `b`'s because the quantifier `+` has higher precedence than concatenation. 
- However, `/(ab)+/` matches sequences like `ab`, `abab`, `ababab`, and so on, because the `+` now applies to the entire group `(ab)`.

**2. To Create Capturing Groups and Backreferences:**
- Parentheses create <em style="color:blue">capturing groups</em>, which "remember" the substring they matched.
- Inside the same regular expression, you can refer to the text captured by a group using a <em style="color:blue">backreference</em>: `\n`, where $n$ is the group's number ($n \in \{1,\cdots,9\}$).
- Groups are numbered starting from 1 based on the position of their opening parenthesis, from left to right.

For example, the regular expression `/(a(b|c)*d)?ef(gh)+/` has three groups:
1. `(a(b|c)*d)` is the first group (the outermost one).
2. `(b|c)` is the second group (nested within the first).
3. `(gh)` is the third group.

A common use for backreferences is to find repeated patterns. For instance, to recognize a string that starts with a number, followed by whitespace, followed by the **same** number, we can use the regular expression `/(\d+)\s+\1/g`. Here, `\1` is a backreference to whatever was captured by the first group `(\d+)`.

In [24]:
"12 12 23 23 17 18".match(/(\d+)\s+\1/g);

[ [32m'12 12'[39m, [32m'23 23'[39m ]


This returns `['12 12', '23 23']`, matching each instance where a number is repeated after some whitespace.

In general, given a digit $n$, the expression `\n` inside the regex pattern refers to the string matched by the $n$-th capturing group of the regular expression.

## The Dot

The regular expression `.` matches any character **except the newline**. For example, `/c.*t/` matches any string that starts with the character `c` and ends with the character `t`, with any characters (except newlines) in between. 

Using the **greedy** quantifier `*`, the regex will match the longest possible substring:

In [25]:
"ct cat caat could we look at that!".match(/c.*t/g)

[ [32m'ct cat caat could we look at that'[39m ]


This returns `['ct cat caat could we look at that']`, matching from the first `c` to the last `t` in a single greedy match.

If we use the **non-greedy** version `*?`, we can find multiple, shorter matches:

In [26]:
"ct cat caat could we look at that!".match(/c.*?t/g);

[ [32m'ct'[39m, [32m'cat'[39m, [32m'caat'[39m, [32m'could we look at'[39m ]


This returns `['ct', 'cat', 'caat', 'could we look at']`, matching from each `c` to the nearest `t`.

**Note:** The dot `.` does not have any special meaning when used inside a character class. Hence, the regular expression `/[.]/` matches only the literal period character `.`.

## Named Groups

Referencing a group via the syntax `\n` where $n$ is a natural number is both cumbersome and error-prone, especially in complex patterns. Instead, we can use <em style="color:blue">named groups</em> for better readability and maintainability.

### Syntax for Named Groups in TypeScript

The syntax to define a named group is: 
```
(?<name>r)
```
where `name` is the name of the group and `r` is the regular expression.

To refer to the string matched by this group **within the same pattern**, we use:
```
\k<name>
```
### Example: Matching Quoted Strings

Below we find strings of alphanumeric characters that are enclosed in either single quotes or double quotes. The character class `['"]` matches either a single or a double quote. By using a named group `quote` and a backreference `\k<quote>`, we ensure that an opening single quote is matched by a closing single quote, and an opening double quote is matched by a closing double quote.

In [27]:
`abc "uvw" and 'xyz'`.match(/(?<quote>['"])\w*\k<quote>/g);

[ [32m'"uvw"'[39m, [32m"'xyz'"[39m ]


This returns `['"uvw"', "'xyz'"]`, correctly matching quoted strings with matching quote types.

## Start and End of a Line

The regular expression `^` matches the **start of a string**. Without any flags, it only matches at the very beginning of the entire input.

The regular expression `$` matches the **end of a string**. Without any flags, it only matches at the very end of the entire input.

When the `m` flag (**multiline mode**) is enabled, the behavior changes:
- `^` also matches immediately after every newline character (i.e., at the start of each line)
- `$` also matches right before every newline character (i.e., at the end of each line)

This is useful for processing multi-line text line-by-line:

In [28]:
`This is a text containing five lines, two of which are empty.
This is the second non-empty line,

and this is the third non-empty line.
`.match(/^.+$/gm)

[
  [32m'This is a text containing five lines, two of which are empty.'[39m,
  [32m'This is the second non-empty line,'[39m,
  [32m'and this is the third non-empty line.'[39m
]


This returns `['This is a text containing five lines, two of which are empty.', 'This is the second non-empty line,', 'and this is the third non-empty line.']`, matching all non-empty lines. The pattern `/^.+$/` matches lines that start (`^`), contain one or more characters (`.+`), and end (`$`). Empty lines don't match because `.+` requires at least one character.

## Lookahead Assertions

Sometimes we need to check what comes **after** a pattern without including it in the match. This is called a <em style="color:blue">lookahead assertion</em>.

### Positive Lookahead (`(?=...)`)

The syntax for a positive lookahead is:
$$ r_1 (\texttt{?=}r_2) $$

Here, $r_1$ and $r_2$ are regular expressions, and `?=` is the <em style="color:blue">lookahead operator</em>. This matches $r_1$ **only if** it is followed by $r_2$, but $r_2$ itself is not included in the match.

**Example:** Find all numbers that are followed by a dollar sign (`$`):

In [29]:
const text = "Here is 1$, here are 21€, and there are 42 $.";
const numbers = text.match(/[0-9]+(?=\s*\$)/g);
console.log(numbers);

[ [32m'1'[39m, [32m'42'[39m ]


This outputs `['1', '42']`, matching only the numbers followed by a dollar sign (with optional whitespace in between). The number `21` is not matched because it's followed by `€`, not `$`.

If you wanted to sum these numbers, you could do:

In [30]:
const sum = numbers.map(Number).reduce((a, b) => a + b, 0);
console.log("Sum:", sum); // Sum: 43

Sum: [33m43[39m


### Negative Lookahead (`(?!...)`)

The syntax for a negative lookahead is:
$$ r_1 (\texttt{?!}r_2) $$

Here, `?!` is the <em style="color:blue">negative lookahead operator</em>. This matches $r_1$ **only if** it is **not** followed by $r_2$.

**Example:** Find all numbers that are **not** followed by a dollar sign:

In [31]:
const text = "Here is 1$, here are 21 €, and there are 42 $.";
const numbers = text.match(/[0-9]+(?![0-9]*\s*\$)/g) ?? [];
console.log(numbers);

[ [32m'21'[39m ]


This outputs `["21"]`, matching only the number `21` because it is followed by `€`, not `$`.

**Why does this work?** The pattern `/[0-9]+(?![0-9]*\s*\$)/` matches one or more digits, but only if they are **not** followed by optional additional digits (`[0-9]*`), optional whitespace (`\s*`), and then a dollar sign (`\$`). This ensures that numbers like `1` in `1$` and `42` in `42 $` are excluded, leaving only `21`, which is followed by `€`.

**Important note:** Negative lookahead can be tricky and error-prone, especially when dealing with complex patterns. In many cases, it's clearer to simply filter results after matching rather than building complex negative lookahead expressions.

## Examples

In order to have some strings to play with, let us read the file `alice.txt`, which contains the book
[Alice's Adventures in Wonderland](https://en.wikipedia.org/wiki/Alice%27s_Adventures_in_Wonderland) written by 
[Lewis Carroll](https://en.wikipedia.org/wiki/Lewis_Carroll).

In [32]:
const text = readFileSync("alice.txt", "utf8");

Let's take a look at the beginning of the book:

In [33]:
console.log(text.slice(0, 1020));


                ALICE'S ADVENTURES IN WONDERLAND

                          Lewis Carroll

               THE MILLENNIUM FULCRUM EDITION 3.0




                            CHAPTER I

                      Down the Rabbit-Hole


  Alice was beginning to get very tired of sitting by her sister
on the bank, and of having nothing to do:  once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought Alice `without pictures or conversation?'

  So she was considering in her own mind (as well as she could,
for the hot day made her feel very sleepy and stupid), whether
the pleasure of making a daisy-chain would be worth the trouble
of getting up and picking the daisies, when suddenly a White
Rabbit with pink eyes ran close by her.

  There was nothing so VERY remarkable in that; nor did Alice
think it so VERY much out of the way to hear the Rabbit say to
itself, `Oh dear!

### How many non-empty lines does this story have?

To count non-empty lines, we use the multiline flag (`m`) and look for lines that contain at least one non-whitespace character (`\S`):

In [34]:
(text.match(/^.*\S.*$/gm) ?? []).length

[33m2725[39m


This returns the total number of lines containing text (ignoring blank lines).

### Checking for "inappropriate" four-letter words

Next, let us check whether this text is suitable for minors. In order to do so, we search for all **four-letter words** that start with either `d`, `f`, or `s` and end with `k` or `t`:

In [35]:
const matches = text.match(/\b[dfs]\w{2}[kt]\b/gi) ?? [];
const uniqueMatches = new Set(matches);
console.log([...uniqueMatches]);

[
  [32m'feet'[39m, [32m'dark'[39m, [32m'sort'[39m,
  [32m'felt'[39m, [32m'shut'[39m, [32m'fact'[39m,
  [32m'FOOT'[39m, [32m'foot'[39m, [32m'salt'[39m,
  [32m'soft'[39m, [32m'Duck'[39m, [32m'suit'[39m,
  [32m'suet'[39m, [32m'fast'[39m, [32m'desk'[39m,
  [32m'flat'[39m, [32m'sink'[39m, [32m'duck'[39m,
  [32m'fork'[39m, [32m'sent'[39m, [32m'spot'[39m
]


The pattern `/\b[dfs]\w{2}[kt]\b/gi` breaks down as:
- `\b` — word boundary
- `[dfs]` — starts with d, f, or s
- `\w{2}` — exactly two word characters
- `[kt]` — ends with k or t
- `\b` — word boundary
- `gi` flags — global, case-insensitive

We use a `Set` to automatically remove duplicate words. The spread operator `[...uniqueMatches]` converts the Set back to an array for display.

### Word count statistics

How many words are in this text and how many different words are used?

In [36]:
const words = text.toLowerCase().match(/\b\w+\b/g) ?? [];
const uniqueWords = new Set(words);
console.log(
  `There are ${words.length} words in this book and ${uniqueWords.size} different words.`
);

There are 27344 words in this book and 2579 different words.


This first extracts all words (converted to lowercase for consistency), then uses a `Set` to count the number of unique words. The output shows both the total word count and the vocabulary size of the book.