# Regular expressions

<div class="alert alert-block alert-info">
    You can find all of the scripts in this notebook in the subdirectory containing this notebook:
    <code>./scripts/regex</code>
</div>

A regular expression, or regex, is a sequence of characters that specifies a search pattern. Regular expressions
are different from Bash's built-in pattern matching (but potentially confusingly use the same symbols for
specifying patterns).

The idea comes from theoretical computer science and formal language theory.
A regular expression might be appropriate when you need to search for text that matches a pattern.
Bash uses the POSIX extended regular expression syntax.


### The `=~` operator

The `=~` operator performs regular expression pattern matching inside of `[[ ]]`. The following:

---
```sh
if [[ $str =~ $re ]]; then
    echo "$str matches $re"
else
    echo "$str does not match $re"
fi

```
---

tests if any part of the string `$str` contains the pattern defined by the regular expression `$re`.

<div class="alert alert-block alert-warning">
    Be careful when quoting the regular expression. Quoting the variable expansion forces the entire
    pattern to be matched as a string instead of performing regular expression matching.<br /><br />
    Quoting part of the regular expression forces the quoted part to be matched as a string.
</div>


The following script is useful for experimenting with regular expressions:

---
```sh
#!/bin/bash

# test_regex.sh

if (( $# != 2 )); then
    echo "Usage: test_regex string regex" >&2
    exit 1
fi
str=$1
re=$2
if [[ $str =~ $re ]]; then
    echo "$str matches $re"
else
    echo "$str does not match $re"
fi

```
---

### POSIX extended regular expressions syntax

A regular expression is simply a string where each character is either a regular character or a metacharacter
having special meaning.

Regular characters are matched literally. For example, the regular expression `a` defines the pattern `a`.
`[[ $str =~ a ]]` is true if the string `str` contains at least one `a`:

In [3]:
./scripts/regex/test_regex.sh hello a

hello does not match a


In [4]:
./scripts/regex/test_regex.sh walrus a

walrus matches a


A string matches the regular expression `and` if the string contains the sequence of characters `and`:

In [5]:
./scripts/regex/test_regex.sh hangry and

hangry does not match and


In [6]:
./scripts/regex/test_regex.sh sandman and

sandman matches and


Metacharacters are symbols that have special meaning in a regular expression. The following table summarizes
the meaning of the metacharacters used in the POSIX extended regular expression syntax:

| Metacharacter | Description |
| :--- | :--- |
| <code>^</code>     | An anchor. Matches the beginning of the string when used as the first character of an expression.|
| <code>&amp;</code>     | An anchor. Matches the end of the string when used as the last character of an expression. |
| <code>.</code>      | Matches any single character. |
| <code>[ ]</code>    | Bracket expression. Matches any single character that is inside the brackets. |
| <code>[^ ]</code>   | Matches any single character that is not inside the brackets. |
| <code>()</code>     | A subexpression. |
| <code>*</code>      | Matches the preceding element zero or more times. |
| <code>?</code>      | Matches the preceding element zero or one time. |
| <code>+</code>      | Matches the preceding element one or more times. |
| <code>{m,n}</code>  | Matches the preceding element at least m and not more than n times |
| <code>\|</code>      | The choice operator. Matches the expression before or after the operator. |

The metacharacters shown above lose their meaning inside of square brackets; thus, metacharacters can be
matched by placing them inside of square brackets.

The following table shows basic examples of regular expressions and strings that match:

| Regex | Matches |
| :--- | :--- |
| <code>hello</code> | any string containing the substring `hello` |
| <code>^hello&amp;</code> | `hello` |
| <code>^.ello&amp;</code> | `1ello`, `Aello`, `jello`, and many more |
| <code>^h..lo&amp;</code> | `h11lo`, `hAylo`, `h@Mlo`, and many more |
| <code>^s[aei]t&amp;</code> | `sat`, `set`, `sit` |
| <code>^s[a-z]t&amp;</code> | `sat`, `sbt`, `sct`, ..., `szt` |
| <code>^s[-a-z]t&amp;</code> | `s-t`, `sat`, `sbt`, ..., `szt` |
| <code>^file[0-9]&amp;</code> | `file0`, `file1`, ..., `file9` |
| <code>^a[.]c&amp;</code> | `a.c` |

A single character (not metacharacters), a subexpression, or a bracket expression is called an *atom*. The
quantifiers `*`, `?`, `+`, and `{m,n}` specify how many times an atom must match.

The following table shows examples of using the quantifier metacharacters to control the number of
characters to match:

| Regex | Matches |
| :--- | :--- |
| <code>^.*&amp;</code> | all strings including the empty string |
| <code>^.+&amp;</code> | all non-empty strings |
| <code>^.?&amp;</code> | the empty string and all strings of length one |
| <code>[0-9]</code> | any string containing at least one digit |
| <code>[0-9]+</code> | any string containing at least one digit |
| <code>^[0-9]+&amp;</code> | any unsigned integer (may overflow) |
| <code>^-[0-9]+&amp;</code> | any negative integer |
| <code>^[-+]?[0-9]+&amp;</code> | any signed or unsigned integer |
| <code>^[-+]?[[:digit:]]+&amp;</code> | any signed or unsigned integer |
| <code>^xa{1}&amp;</code> | `xa` |
| <code>^xa{1,}&amp;</code> | `x` followed by one or more `a`s |
| <code>^xa{1,3}&amp;</code> | `xa`, `xaa`, or `xaaa` |
| <code>^(abc)+&amp;</code> | `abc`, `abcabc`, `abcabcabc`, ... |

The metacharacter `|` matches the regular expression before or after the `|`.