# Regular expressions

<div class="alert alert-block alert-info">
    You can find all of the scripts in this notebook in the subdirectory containing this notebook:
    <code>./scripts/regex</code>
</div>

> Some people, when confronted with a problem, think “I know,
> I'll use regular expressions.”  Now they have two problems.
> -Jamie Zawinski (see http://regex.info/blog/2006-09-15/247)

A regular expression, or regex, is a sequence of characters that specifies a search pattern. Regular expressions
are different from Bash's built-in pattern matching (but potentially confusingly use the same symbols for
specifying patterns).

The idea comes from theoretical computer science and formal language theory.
A regular expression might be appropriate when you need to search for text that matches a pattern.
Bash uses the POSIX extended regular expression syntax.

This notebook does not cover all features of regular expressions (otherwise it would essentially be a book
of its own). Interested readers may refer to the end of this notebook for additional resources.

### The `=~` operator

The `=~` operator performs regular expression pattern matching inside of `[[ ]]`. The following:

---
```sh
if [[ $str =~ $re ]]; then
    echo "$str matches $re"
else
    echo "$str does not match $re"
fi

```
---

tests if any part of the string `$str` contains the pattern defined by the regular expression `$re`.

<div class="alert alert-block alert-warning">
    Be careful when quoting the regular expression. Quoting the variable expansion forces the entire
    pattern to be matched as a string instead of performing regular expression matching.<br /><br />
    Quoting part of the regular expression forces the quoted part to be matched as a string.
</div>


The following script is useful for experimenting with regular expressions:

---
```sh
#!/bin/bash

# test_regex.sh

if (( $# != 2 )); then
    echo "Usage: test_regex string regex" >&2
    exit 1
fi
str=$1
re=$2
if [[ $str =~ $re ]]; then
    echo "The string $str is matched by the regex $re"
else
    echo "The string $str is not matched by the regex $re"
fi

```
---

### POSIX extended regular expressions syntax

A regular expression is simply a string where each character is either a regular character or a metacharacter
having special meaning.

Regular characters are matched literally. For example, the regular expression `a` defines the pattern `a`.
`[[ $str =~ a ]]` is true if the string `str` contains at least one `a`:

In [None]:
./scripts/regex/test_regex.sh hello a

In [None]:
./scripts/regex/test_regex.sh walrus a

A string matches the regular expression `and` if the string contains the sequence of characters `and`:

In [None]:
./scripts/regex/test_regex.sh hangry and

In [None]:
./scripts/regex/test_regex.sh sandman and

Metacharacters are symbols that have special meaning in a regular expression. The following table summarizes
the meaning of the metacharacters used in the POSIX extended regular expression syntax:

| Metacharacter | Description |
| :--- | :--- |
| <code>^</code>     | An anchor. Matches the beginning of the line when used as the first character of an expression.|
| <code>\$</code>     | An anchor. Matches the end of the line when used as the last character of an expression. |
| <code>.</code>      | Matches any single character. |
| <code>[ ]</code>    | Bracket expression. Matches any single character that is inside the brackets. |
| <code>[^ ]</code>   | Matches any single character that is not inside the brackets. |
| <code>()</code>     | A subexpression. |
| <code>*</code>      | Matches the preceding element zero or more times. |
| <code>?</code>      | Matches the preceding element zero or one time. |
| <code>+</code>      | Matches the preceding element one or more times. |
| <code>{m,n}</code>  | Matches the preceding element at least m and not more than n times |
| <code>\|</code>      | The choice operator. Matches the expression before or after the operator. |

The metacharacters shown above lose their meaning inside of square brackets; thus, metacharacters can be
matched by placing them inside of square brackets.

The following table shows basic examples of regular expressions and strings that match:

| Regex | Matches |
| :--- | :--- |
| <code>hello</code> | any string containing the substring `hello` |
| <code>^hello\$</code> | `hello` |
| <code>^.ello\$</code> | `1ello`, `Aello`, `jello`, and many more |
| <code>^h..lo\$</code> | `h11lo`, `hAylo`, `h@Mlo`, and many more |
| <code>^s[aei]t\$</code> | `sat`, `set`, `sit` |
| <code>^s[a-z]t\$</code> | `sat`, `sbt`, `sct`, ..., `szt` |
| <code>^s[-a-z]t\$</code> | `s-t`, `sat`, `sbt`, ..., `szt` |
| <code>^file[0-9]\$</code> | `file0`, `file1`, ..., `file9` |
| <code>^a[.]c$</code> | `a.c` |

In [None]:
./scripts/regex/test_regex.sh "Say hello" hello

A single character (not metacharacters), a subexpression, or a bracket expression is called an *atom*. The
quantifiers `*`, `?`, `+`, and `{m,n}` specify how many times an atom must match.

The following table shows examples of using the quantifier metacharacters to control the number of
characters to match:

| Regex | Matches |
| :--- | :--- |
| <code>^.*\$</code> | all strings including the empty string |
| <code>^.+\$</code> | all non-empty strings |
| <code>^.?\$</code> | the empty string and all strings of length one |
| <code>[0-9]</code> | any string containing at least one digit |
| <code>[0-9]+</code> | any string containing at least one digit |
| <code>^[0-9]+\$</code> | any unsigned integer (may overflow) |
| <code>^-[0-9]+\$</code> | any negative integer |
| <code>^[-+]?[0-9]+\$</code> | any signed or unsigned integer |
| <code>^[-+]?[[:digit:]]+\$</code> | any signed or unsigned integer |
| <code>^xa{1}\$</code> | `xa` |
| <code>^xa{1,}\$</code> | `x` followed by one or more `a`s |
| <code>^xa{1,3}\$</code> | `xa`, `xaa`, or `xaaa` |
| <code>^(abc)+\$</code> | `abc`, `abcabc`, `abcabcabc`, ... |

In [None]:
./scripts/regex/test_regex.sh "^.*$" ""

Two regular expressions may be joined using the metacharacter `|`. The resulting regular expression
matches any string that matches either of the joined expressions.


| Regex | Regex before `\|` | Regex after `\|` | Matches |
| :--- | :--- | :--- | :--- |
| <code>a&vert;b</code> | <code>a</code> | <code>b</code> | any string containing an `a` or a `b` |
| <code>^b&vert;cat\$</code> | <code>^b</code> | <code>cat\$</code> | any string starting with `b` or ending with `cat` |
| <code>^b&vert;(^c)at\$</code> | <code>^b</code> | <code>(^c)at\$</code> | any string starting with `b` or the string `cat` |
| <code>(^b&vert;^c)at\$</code> | <code>^b</code> | <code>^c</code> | `bat` or `cat` |

In [None]:
./scripts/regex/test_regex.sh java "a|b"

What do the following regular expressions match?

* <code>^(19|20)[0-9][0-9]\$</code>
* <code>^(0[1-9]|1[0-2])\$</code>
* <code>^(0[1-9]|[12][0-9]|3[01])\$</code>

### The `grep` program

`grep` (shorthand for "global regular expression print") is one of many programs that use regular expressions.
`grep` searches a file/files or standard input for lines that match a regular expression.
Use the `-E` option to use POSIX extended regexes otherwise
`grep` uses POSIX basic regexes. The basic usage is:

```sh
grep -E [options] regex [file...]
```

If no files are given, then `grep` reads from standard input.

Some examples of using the `grep` program are shown in the following cells.

#### List files in `/usr/bin` that contain the string `zip` in their name

In [None]:
echo "files that contain zip"
ls /usr/bin | grep -E zip

echo "files that start with zip"
ls /usr/bin | grep -E ^zip

echo "files that end with zip"
ls /usr/bin | grep -E zip$


#### Print all lines of a Java source code file that start a `for` loop

The `-n` option prints the line number of the matching line.

In [None]:
grep -En for ./scripts/regex/AllWordsLookup.java

Slightly better is to search for the string `for` followed by zero or more spaces followed by the opening
`(`. The `(` is a metacharacter in a regex so it must be escaped or placed inside `[]`:

In [None]:
grep -En "for[[:blank:]]*[(]" ./scripts/regex/AllWordsLookup.java

#### Print all lines of an HTML file that contain an anchor tag

In [None]:
grep -En "<a " mywebpage.html

#### Help solving a crossword puzzle clue

If you install the `spell` program then there will be a dictionary file named `/usr/share/dict/words`. `grep`
can be used to search the dictionary file for words that match a pattern.

In [None]:
grep -Ei "^.ak.s$" /usr/share/dict/words

# More information

> Regular expressions are like a particularly spicy hot sauce – to be used in moderation and with restraint only when appropriate. -Jeff Atwood (https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/)

Regular expressions are a powerful tool but can be challenging to read and write.
Readers seeking more information
on regular expressions may find the following resources to be useful:

* https://www.regular-expressions.info/
* http://regex.info/book.html
* https://www.rexegg.com/
