# Regular Expressions in JavaScript
Herman Leung / 21 July 2017

## Table of Contents
1. [Basics](#1)
2. [Functions](#2)
3. [Flags](#3)
4. [Character classes](#4)
5. [Ranges](#5)
6. [Groups](#6)
7. [Quantifiers](#7)
8. [Anchors](#8)
9. [Assertions](#9)
10. [References and tools](#10)

### 1. Basics -- characters, strings, and escaping with backslash<a name='1'></a>


Backslash for quotes and apostrophes in string notation

In [None]:
var text = "She said, \"I'm going to eat six spoons of fresh snow peas.\" ";
var text2 = 'She said, "I\'m going to eat six spoons of fresh snow peas." ';

text === text2; // true

In regex patterns, functional chars may need to be escaped if you want the literal character. These include: . + ? * - [ ] ( ) \ / ^ $ |

In [None]:
// cannot be parsed
'I (Caesar) came, I saw, I conquered'.search(/(/); 

In [None]:
// need to escape the open parenthesis
'I (Caesar) came, I saw, I conquered'.search(/\(/); // index of match returned

The backslash does not affect all chars, however

In [None]:
'='.search(/=/);

In [None]:
'='.search(/\=/);

Some chars are always escaped

In [None]:
console.log('a\na'); // \n - new line / line feed (LF)
     console.log('=====');
console.log('C\rD'); // \r - carriage return (CR)
     console.log('=====');
console.log('1\t2'); // \t - tab

In [None]:
// Hexadecimal and unicode chars
console.log('\xe3');   // \x + 2 chars /[0-9a-f]/i
console.log('\u8003'); // \u + 4 chars /[0-9a-f]/i

In [None]:
// Astrals
console.log('\u{1F4A9}'); // ES6 \u + up to 6 chars inside curly brackets
console.log('\uD83D\uDCA9'); // ES5 equivalent
// NOTE 1: Astrals are implemented as composed of 2 chars
// NOTE 2: There are other composed chars that look like 1 char but are many

**Multi-line strings**

*>>> Using backticks (ES6)*

In [None]:
// In ES6, the backticks `` allow you to spread a string across multiple lines, 
// and all the white space is included in the string

console.log(`Three blind mice
    see how they run!`);

*>>> Using \ at the end of the line*

In [None]:
// If you use single or double quotes, you can use the backslash to spread a
// string across multiple lines, and line breaks are ignored (but not char space).

var str = 'Three blind mice \
           see how they run!';
console.log(str);

### 2. Functions<a name='2'></a>

| FUNCTION | USAGE | RETURNS | |
| :--- | :--- | :--- | :--- |
| | | (NO MATCH) | (MATCH) |
| **.match()** | `str`**.match**(`regex`) | null | array: [first match, index of match, input string] |
| | with flag 'g' | null | array: of all the matches only (no indexes or input string) |
| **.exec()** | `regex`**.exec**(`string`) | null | array: [match, index of match, input string] |
| **.search()** | `str`**.search**(`regex`) | -1 | index of the first match |
| **.replace()** | `str`**.replace**(`regex`, `replacement`) | new string with replacement(s) |
| **.test()** | `regex`**.test**(`string`) | false | true |

#### .match()

In [None]:
'abcda'.match(/a/); // one match returns extra info on the match

In [None]:
'abcda'.match(/a/g); // more than one match returns just the matches

In [None]:
'abcda'.match(/z/); // no match returns null

#### .exec()

In [None]:
// without flag g, functions just like .match()
REGEX = /a/; 
REGEX.exec('abcda');

In [None]:
REGEX.exec('wxyzw');

In [None]:
// with flag, each call stops immediately after a match
//      so it can be called as many times as there are matches, until the end of the string is reached
REGEX = /a/g;
REGEX.exec('abcda'); 

In [None]:
REGEX.exec('abcda');

In [None]:
REGEX.exec('abcda');

#### .search()

In [None]:
'abcda'.search(/d/); // returns index of first match

In [None]:
'abcda'.search(/z/); // returns -1 if no match

#### .replace()

In [None]:
'abcda'.replace(/a/, 'Z');

#### .test()

In [None]:
/a/.test('abcda');

In [None]:
/z/.test('abcda');

### 3. Flags <a name='3'></a>

| FLAG | MEANING | FUNCTION |
| :--- | :--- | :--- |
| g | global | used with `.match()` and `.exec()`, finds all matches (not just the first) |
| i | ignore case | does not distinguish lower and upper case |
| m | multiline | tells `^` and `$` to treat `\n` or `\r` as boundaries (more later) |
| u | unicode | treats the regex pattern as unicode (may need it when encountering encoding errors) |

#### g -- global

In [None]:
'ABCDA'.match(/A/g);

#### i = ignore case

In [None]:
'ABCDA'.search(/b/i);

#### m - multi-line

In [None]:
'Dogs are funny.\nDogs are hairy.'.match(/^D/gm);

In [None]:
'Dogs are funny.\nDogs are hairy.'.match(/\.$/gm);

### 4. Character classes <a name='4'></a>

| CHAR | DEFINITION | CHAR | DEFINITION |
| :--- | :--- | :--- | :--- |
| \d | numerical [0-9] | \D  | non-numerical [^0-9] |
| \w | word [A-Za-z0-9_]| \W | non-word [^A-Za-z0-9_] |
| \s | white space | \S | not white space |

In [None]:
'123'.search(/\d/);

In [None]:
'123'.search(/\D/);

In [None]:
'abc'.match(/\w/g);

In [None]:
'abc'.match(/\W/g);

In [None]:
'\t\r\n'.replace(/\s/g, 'Y');

In [None]:
/\S/.test('\t\r\n');

### 5. Ranges <a name='5'></a>
NOTE: All of the below match patterns of 1 character only

`[]` -- range specification

`[^ ]` -- range specification, excluding

`.` -- everything except `\n`

`|` -- or

#### [ ] and [^ ]

In [None]:
'I ain\'t got no money, yo!'.match(/[',!]/g);

In [None]:
'I ain\'t got no money, yo!'.match(/[^\w]/g);

In [None]:
'I ain\'t got no money, yo!'.replace(/[A-Za-z]/gi, '_');

#### . (dot --> everything except \n)

In [None]:
'.,>@5n'.match(/./g); // dot matches everything, except...

In [None]:
/./.test('\n'); // newline

In [None]:
// So if you really want to include everything, use .|\n
'Line\nLine'.match(/.|\n/g);

#### | (or)

In [None]:
// | -- or
/da|bf/.test('daf');

In [None]:
/da|bf/.test('dzf');

### 6. Groups <a name='6'></a>

| REGEX | DESCRIPTION |
| :--- | :--- |
| ( ) | Capturing group |
| (?: ) | Non-capturing group |
| \1 | Group reference within regex pattern |
| **Referencing groups and matches in `.replace()`** |
| \$1 | insert captured group (\$2 = 2nd captured group, \$3 = 3rd, etc.) |
| \$\` | insert string preceding (1st) captured group |
| \$' | insert string following (1st) captured group |
| \$& | insert entire match |

#### ( ) -- capturing group

In [None]:
// What if you want to use the | over more than one char?
// Use () to contain the multi-char alternatives
'aaBBaaaaCCaa'.match(/aa(BB|CC)aa/g);

In [None]:
// The parentheses also has a "grouping" function (let's use .match() without the g flag)
'aaBBaaaaCCaa'.match(/aa(BB|CC)aa/); // [ 'aaBBaa', 'BB', index: 0, input: 'aaBBaaaaCCaa' ]
                                     // Notice the 'BB' captured by the ()

In [None]:
'aaBBaaaaCCaa'.match(/aa(BB|CC)aa/g); // .match() with g flag won't return the group captures

In [None]:
// But .exec() allows us get the group captures of all matches, using a loop
REGEX = /aa(BB|CC)aa/g;
str = 'aaBBaaaaCCaaaaBBaa';
var myArray;
while ((myArray = REGEX.exec(str)) !== null) {
    console.log(myArray);
}

#### \1, \2, \3... -- referencing group captures

In [None]:
// Grouping also allows reference back to the captured group
'dada'.match(/(da)\1/); // [ 'dada', 'da', index: 0, input: 'dada' ]

In [None]:
'data'.match(/(da)\1/); // null

In [None]:
// NOTE: the reference is to what is captured, NOT the pattern itself!
'datata'.match(/(da|ta)\1/); // ['tata']

In [None]:
'datata'.match(/([dt]a)\1/); // ['tata']

In [None]:
// You can have as many groups as you want
'efghijefghij'.match(/(e)(f)(g)(h)(i)(j)\1\2\3\4\5\6/);

In [None]:
// The numbering depends on the order in which the ( appears
//      For example, in the case of nested groups
'tatatat'.match(/((t|d)a)\1\2/); // \1 = 'ta', \2 = 't'

#### (?: ) -- non-capturing group

In [None]:
// If you don't want the parentheses to make a capture, use (?: )
'aaCCaa'.match(/aa(?:BB|CC)aa/);

#### Referencing groups and matches in `.replace()`

*>>> Referencing by group order*

In [None]:
// Use $1, $2, etc. if you want to reference a captured group in the replacement string in .replace(), 
'HyperText Markup Language'.replace(/([a-z])([A-Z])/, '$1 $2');

*>>> String before match pattern*

In [None]:
// Use $` (dollar sign + backtick) to get string BEFORE match pattern
'hyperTEXT MARKUP'.replace(/[A-Z]/, '$`'); // matched pattern = 'T' (index 5)

In [None]:
'hyperTEXT MARKUP'.replace(/[A-Z]/, '<$`>'); // better visualization

*>>> String after match pattern*

In [None]:
// Use $' (dollar sign + single quote) to get string AFTER matched pattern
'hyperTEXT MARKUP'.replace(/[A-Z]/, "$'");

In [None]:
'hyperTEXT MARKUP'.replace(/[A-Z]/, "<$'>"); // better visualization

// Escape the single quote if inside single quotes:
'hyperTEXT MARKUP'.replace(/[A-Z]/, '<$\'>');

*>>> Entire match*

In [None]:
// To get the entire matched pattern, use $&
'hyperTEXT MARKUP'.replace(/[A-Z]/, '$&$&$&$&$&');

In [None]:
'hyperTEXT MARKUP'.replace(/[A-Z]/, '<$&$&$&$&$&>'); // better visualization

### 7. Quantifiers <a name='7'></a>
Quantifiers are placed immediately AFTER the char they quantify

| REGEX | DESCRIPTION |
| :--- | :--- |
| ? | appears 0 or 1 time |
| \* | appears 0 or any number of times |
| + | appears 1 or more times |
| \*? | non-greedy \* |
| +? | non-greedy \+ |
||
| {2} | appears 2 times |
| {3,5} | appears 3-5 times |
| (4,) | appears at least 4 times |

### ? --> 0 or 1 time

In [None]:
// Optional parentheses for country code in phone numbers
'852 2345 6789'.search(/^\(?852\)?/);

In [None]:
'(852) 2345 6789'.search(/^\(?852\)?/);

#### * --> 0, 1, or many times

In [None]:
// Get all words with "watch" followed by any number of letters
'watch, watchy, watched, watching'.match(/watch[a-z]*/g);

#### + --> 1 or more times

In [None]:
// + --> appears 1 or more times
// Get all continuous "word" strings
"My name is R2D2. I'm a robot.".match(/[\w]+/g);

#### Non-greedy ? (placed after + or \*)
- \* and + are by default "greedy"
- That means it will search until the very last instance of the char pattern after * or +, and return everything in between

In [None]:
"My name is R2D2. I'm a robot.".match(/^.+\./g); // bad attempt at getting only one sentence

In [None]:
"My name is R2D2. I'm a robot.".match(/^.+?\./g); // non-greedy ? to the rescue

In [None]:
// Another example

var htmlStr = '<div class="container">Container</div>  <div class="jumbotron">Header</div>';
htmlStr.match(/<div class="container">.+<\/div>/g); // bad attempt at getting the container div

In [None]:
htmlStr.match(/<div class="container">.+?<\/div>/g); // non-greedy ? to the rescue

**NOTE** 
- The above doesn't take care of nested tags, and nested things are not straighforward to do in regex.
- **Use an HTLM / XML / etc. parser instead!**

#### { } --> exact (and greedy) quantification

In [None]:
'The flower bloomed in the voracious wind.'.match(/\w{4}/g); // 4-letter strings

In [None]:
'The flower bloomed in the voracious wind.'.match(/\w{4,6}/g); // 4-6 letters

In [None]:
'The flower bloomed in the voracious wind.'.match(/\w{4,}/g); // 4 or more letters

### 8. Anchors <a name='8'></a>

**NOTE**: These DON'T represent actual characters

| REGEX | DESCRIPTION | |
| :-- | :-- | :-- |
| | WITHOUT FLAG 'm' | WITH FLAG 'm' |
| ^ | start of the string | start of a string or line (bounded by \n or \r) |
| \$ | end of the line | end of a string or line (bounded by \n or \r) |

| REGEX | DESCRIPTION |
| :-- | :-- |
|| NOT AFFECTED BY FLAG 'm' |
| \b | word boundary (edges of continuous \w strings, i.e. [A-Za-z0-9_]) |

#### ^ --> start of the string(/line)

In [None]:
"There's a rainbow here.\nThere's a rainbow there.".match(/^There's/g);

In [None]:
"There's a rainbow here.\nThere's a rainbow there.".match(/^There's/gm); // m flag

#### $ --> end of the string(/line)

In [None]:
"There's a rainbow here.\nThere's a rainbow there.".match(/t?here\.$/g);

In [None]:
"There's a rainbow here.\nThere's a rainbow there.".match(/t?here\.$/gm); // m 

#### \b --> word boundary

In [None]:
"My name is R2D2. I'm a robot.".match(/\b.+?\b/g);

In [None]:
"My name is R2D2. I'm a robot.".match(/\b\w+?\b/g); // just the "words"

### 9. Assertions <a name='9'></a>

These allow you look at the following context of your regex pattern without including that context in your match.

| REGEX | DESCRIPTION |
| :-- | :-- |
| (?= ) | positive lookahead |
| (?! ) | negative lookahead |

#### (?= ) -- > positive lookahead

In [None]:
'aaBBaaCCaa'.match(/aa(BB|CC)(?=aa)/g);

In [None]:
// Although assertions make use of parentheses, they DON'T trigger grouping
'aaBBaa'.match(/aaBB(?=aa)/);

#### (?! ) --> Negative lookahead

In [None]:
'aaBBaaaaBB--'.match(/aaBB(?!aa)/);

**Why use assertions?**
1. You don't want the match to contain the string inside the assertion
2. You want to get overlapping matches (normally you can't)

In [None]:
// Normally, in a global search, once there is a match, that match is not searched again
'aaBBaaBBaa'.match(/aaBBaa/g);

/*  What happens step by step:

    1. 'aaBBaaBBaa' --> MATCH 'aaBBaa', REMAINING SEARCH POOL '______BBaa'
    2. '______BBaa' --> NO MATCH
    3. END SEARCH, RETURN ['aaBBaa']
*/

In [None]:
// With a lookahead assertion, the "asserted string" stays in the search pool
'aaBBaaBBaa'.match(/aaBB(?=aa)/g);

/*  What happens step by step:
    1. 'aaBBaaBBaa' --> MATCH 'aaBB', REMAINING SEARCH POOL '____aaBBaa'
    2. '____aaBBaa' --> MATCH 'aaBB', REMAINING SEARCH POOL '________aa'
    3. '________aa' --> NO MATCH
    4. END SEARCH, RETURN ['aabb', 'aabb']
*/

**Unfortunately, JS doesn't have lookbehind because they forgot about it!**
- As of ES8, it is still not available yet.
- Some people have figured out some workarounds, but they can get very complicated.

### 10. Reference and tools <a name='10'></a>

- MDN documentation: https://developer.mozilla.org/en/docs/Web/JavaScript/Guide/Regular_Expressions
- Javascript regex cheatsheet: https://www.debuggex.com/cheatsheet/regex/javascript
- Regex tool 1: http://regexr.com/
- Regex tool 2: https://regex101.com/
- Atom regex helper: `regex-railroad-diagram`

**NOTE:** Different programming languages have different implementations of regular expressions / pattern matching functions, although the basics are usually the same.

- PostgreSQL: https://www.postgresql.org/docs/9.6/static/functions-matching.html
- Python: https://docs.python.org/3/library/re.html

**NOTE:** Many text editing software have some implementation of regular expressions also
- In `Atom`, COMMAND/CTRL + F, then click on the button [ .\* ] to enable/disable regex