# Introduction to regular expression with examples

A regular expression is a sequence of characters that specifies a search pattern. 
This tutorial is written in Scala, but the tokens/patterns can be used in other languages.


-----------------------------------------------------------

## Summary 

**Char**

Digit / Alphanumeric / Whitespace

1. `\d`: Any digit from 0 to 9
1. `\D`: Any non-digit 
1. `\w`: Any alphanumeric 
1. `\W`: Any non-alphanumeric 
1. `\s`: Any whitespace
1. `\S`: Any non-whitespace

Wildcard

1. `"."`: the wildcard char


Match character

1. `[abc]`: Match specific characters
1. `[^abc]`: Exclude specific characters


 Range

1. `[a-z]`: Match a char within the range
1. `[^a-z]`: Exclude a char within the range
1. `[a-z0-9]`: Match a char within multiple ranges

------------

**String**

Match string 

1. `"abc"`: Match a substring that is the same as the pattern


Repetitions 

1. `{m}`:	m repetitions
1. `{m,n}`: m to n repetitions
1. `{m,}`: m to infinite repetitions
1.  `*` : Kleene Star - 0 or more repetitions
1.  `+` : Kleene Plus - 1 or more repetitions

Starting and ending

1. `^` :   Start of the line
1. `$` : End of the line

Capture Group

1. `(…)`:	Capture Group
1. `case ... match ...` : Capture Groups in Scala

Optional

1.  `?`: match either zero or one of the preceding character or group
1. `(foo|bar)` : match `foo` or `bar`

Capture all

1. `.*` Capture all



In [1]:
//  Scala dependency 
import scala.util.matching.Regex

[32mimport [39m[36mscala.util.matching.Regex[39m

------------------------------------------------

## Char

### Digit / Alphanumeric / Whitespace

#### `\d`: Any digit from 0 to 9 

The preceding slash `\` distinguishes it from the simple d character and indicates a metacharacter.

> **Note**: need to use double slash in Scala string for \d - `"\\d".r`

For example, `\\d`: 

- match `1` in `1234`  
- match `2` in `2 foo`


#### `\D`: Any non-digit character

For example, `"\\D"` :
 - match `" "` (space) in `1234 a` 
 - match `a` in `a 2 foo`

#### `\w`:	Any alphanumeric character

Equivalent to the character range `[A-Za-z0-9_]`

For example, `\\w`: match 
- `A` in `Ana`	
- `0` in `*012`

and skip `"***"`

#### `\W`:	Any non-alphanumeric character

For example, `\\W`: match  `*` in `"***"`

and skip 
- `Ana`	
- `0123 Bob`

#### `\s`: Any whitespace

Whitespace
- space `" "`
- tab `\t`
- new line `\n` 
- carriage return `\r`

In [2]:
val pattern = "\\d".r
val text = "1234"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = \d
[36mtext[39m: [32mString[39m = [32m"1234"[39m
[36mres1_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m"1"[39m)

In [3]:
val pattern = "\\D".r
val text = "1234 a"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = \D
[36mtext[39m: [32mString[39m = [32m"1234 a"[39m
[36mres2_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m" "[39m)

In [4]:
val pattern = "\\w".r
val text = "*012"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = \w
[36mtext[39m: [32mString[39m = [32m"*012"[39m
[36mres3_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m"0"[39m)

In [5]:
val pattern = "\\W".r
val text = "***"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = \W
[36mtext[39m: [32mString[39m = [32m"***"[39m
[36mres4_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m"*"[39m)

In [6]:
val pattern = "\\d.\\s+abc".r
val text = "3.           abc"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = \d.\s+abc
[36mtext[39m: [32mString[39m = [32m"3.           abc"[39m
[36mres5_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m"3.           abc"[39m)

In [7]:
val pattern = "\\d.\\s+abc".r
val text = "4.abc"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = \d.\s+abc
[36mtext[39m: [32mString[39m = [32m"4.abc"[39m
[36mres6_2[39m: [32mOption[39m[[32mString[39m] = [32mNone[39m

---------------------

### Wildcard

####  ".": The wildcard char

A wildcard is a card that can represent any card in the deck in poker games. Similarly, `.` (dot)  can match any single character (letter, digit, whitespace, everything). 


> **Note**: 
```
.  is the wildcard
\\. is the dot symbol or period
```

For example, `...\\.`: match 
- `"cat."`
- `"896."`
- `"?=+."`	

and skip	
- `abc1`

In [8]:
val pattern = "...\\.".r
val text = "cat."
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = ...\.
[36mtext[39m: [32mString[39m = [32m"cat."[39m
[36mres7_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m"cat."[39m)

In [9]:
val pattern = "...\\.".r
val text = "abc1"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = ...\.
[36mtext[39m: [32mString[39m = [32m"abc1"[39m
[36mres8_2[39m: [32mOption[39m[[32mString[39m] = [32mNone[39m

-----------

### Match character

#### `[abc]`: Match specific characters

Define the specific characters you want to match inside square brackets. The pattern `[abc]` will only match a single `a`, `b`, or `c` letter and nothing else.


For example, `[cmf]an`: match 
- `"can"`
- `"man"`
- `"fan"`	

and skip	
- `dan`
- `ran`
- `pan`



#### `[^abc]`: Exclude specific characters

We exclude specific characters by using the square brackets and the `^` (hat). 
For example, the pattern `[^abc]` will match any single character except for the letters `a`, `b`, or `c`.

> **Note** 
It is different from the hat used as "start of the line" `^start` for excluding characters, which can be confusing when reading regular expressions.

For example, `[^cmf]an`: match 
- `dan`
- `ran`
- `pan`

and skip	
- `"can"`
- `"man"`
- `"fan"`	


In [10]:
val pattern = "[cmf]an".r
val text = "can"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = [cmf]an
[36mtext[39m: [32mString[39m = [32m"can"[39m
[36mres9_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m"can"[39m)

In [11]:
val pattern = "[cmf]an".r
val text = "dan"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = [cmf]an
[36mtext[39m: [32mString[39m = [32m"dan"[39m
[36mres10_2[39m: [32mOption[39m[[32mString[39m] = [32mNone[39m

In [12]:
val pattern = "[^cmf]an".r
val text = "dan"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = [^cmf]an
[36mtext[39m: [32mString[39m = [32m"dan"[39m
[36mres11_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m"dan"[39m)

In [13]:
val pattern = "[^cmf]an".r
val text = "can"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = [^cmf]an
[36mtext[39m: [32mString[39m = [32m"can"[39m
[36mres12_2[39m: [32mOption[39m[[32mString[39m] = [32mNone[39m

-------------------------------------------

### Range

#### `[a-z]`: Match a char within the range

Match a character in a list of sequential characters by using the dash to indicate a character range. 

For example, `[0-6]`: match any single digit character from `0` to `6` 



#### `[^a-z]`: Exclude a char within the range

For example, `[^n-p]`: match any single character except for letters `n` to `p`


#### `[a-z0-9]`: Match a char within multiple ranges
Multiple character ranges can also be used in the same set of brackets 

For example, `[A-Z0-9]`:  match any single digit character from `A-Z` or `0` to `9`  



In [14]:
val pattern = "[A-C][n-p][a-c]".r
val text = "Ana"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = [A-C][n-p][a-c]
[36mtext[39m: [32mString[39m = [32m"Ana"[39m
[36mres13_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m"Ana"[39m)

In [15]:
val pattern = "[A-C][n-p][a-c]".r
val text = "aax"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = [A-C][n-p][a-c]
[36mtext[39m: [32mString[39m = [32m"aax"[39m
[36mres14_2[39m: [32mOption[39m[[32mString[39m] = [32mNone[39m

In [16]:
val pattern = "[A-C0-9][A-C0-9]".r
val text = "A0x"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = [A-C0-9][A-C0-9]
[36mtext[39m: [32mString[39m = [32m"A0x"[39m
[36mres15_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m"A0"[39m)


----------------------

## String

### Match string

#### "abc": Match a substring that is the same as the pattern

For example,`"foo 1"`: match `"foo 1"` in `"foo 1 fooo"`



In [17]:
val pattern = "foo 1".r
val text = "foo 1 fooo"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = foo 1
[36mtext[39m: [32mString[39m = [32m"foo 1 fooo"[39m
[36mres16_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m"foo 1"[39m)

----------------------------

### Repetitions

#### `{m}`:	m repetitions

For example, `B{3}`:  match the `B` character exactly three times


#### `{m,n}`: m to n repetitions

For example, `B{1,3}`: match the `B` character for 1-3 times



#### `{m,}`: m to infinite repetitions

For example, `B{3,}`:  match the `B` character for at least 3 times



> **Note** 
{,m} is Illegal

Error msg:

```
java.util.regex.PatternSyntaxException: 
Illegal repetition near index 2 pur{,3}

```

####  `*` : Kleene Star 0 or more repetitions

For example, `\d*`: match any number of digits


####  `+` :	 Kleene Plus 1 or more repetitions

For example, `\d+` match any number of digits with at least one digit.





#### Exercise

Match: `aaaabcc`	,`aabbbbc`, `aacc`
Skip: `a`

**Solutions**

- `a\w+` 
- `a{2}[abc]*`
- `aa+b*c+` 
- `a{2,4}b{0,4}c{1,2}`


In [18]:
val pattern = "pur{3}".r
val text = "purrrrr"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = pur{3}
[36mtext[39m: [32mString[39m = [32m"purrrrr"[39m
[36mres17_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m"purrr"[39m)

In [19]:
val pattern = "pur{1,3}".r
val text = "purrr"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = pur{1,3}
[36mtext[39m: [32mString[39m = [32m"purrr"[39m
[36mres18_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m"purrr"[39m)

In [20]:
val pattern = "pur{1,3}".r
val text = "pu"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = pur{1,3}
[36mtext[39m: [32mString[39m = [32m"pu"[39m
[36mres19_2[39m: [32mOption[39m[[32mString[39m] = [32mNone[39m

In [21]:
val pattern = "pur{3,}".r
val text = "purrrrrrr"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = pur{3,}
[36mtext[39m: [32mString[39m = [32m"purrrrrrr"[39m
[36mres20_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m"purrrrrrr"[39m)

In [22]:
val pattern = "\\w+".r
val text = ""
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = \w+
[36mtext[39m: [32mString[39m = [32m""[39m
[36mres21_2[39m: [32mOption[39m[[32mString[39m] = [32mNone[39m

In [23]:
val pattern = "\\w*".r
val text = ""
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = \w*
[36mtext[39m: [32mString[39m = [32m""[39m
[36mres22_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m""[39m)

In [24]:
val pattern = "\\w*".r
val text = "anyAlphanumeric"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = \w*
[36mtext[39m: [32mString[39m = [32m"anyAlphanumeric"[39m
[36mres23_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m"anyAlphanumeric"[39m)

-----------------------

### Starting and ending

#### `^` :   Start of the line

> **Note** 
`^success` match only a line that begins with the word `"success"`, but not the line `"Error: unsuccessful operation`

> **Note** 
It is different from the hat used inside a set of bracket `[^...]` for excluding characters, which can be confusing when reading regular expressions.



#### `$` : End of the line



In [25]:
val pattern = "end$".r
val text = "The end"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = end$
[36mtext[39m: [32mString[39m = [32m"The end"[39m
[36mres24_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m"end"[39m)

In [26]:
val pattern = "^start".r
val text = "starting"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = ^start
[36mtext[39m: [32mString[39m = [32m"starting"[39m
[36mres25_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m"start"[39m)

In [27]:
val pattern = "^start".r
val text = "Now start"
pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = ^start
[36mtext[39m: [32mString[39m = [32m"Now start"[39m
[36mres26_2[39m: [32mOption[39m[[32mString[39m] = [32mNone[39m

------------------

### Capture Group

Regular expressions allow us not just to match text but also to extract information for further processing. This is done by defining groups of characters and capturing them using the special parentheses `(` and `)` metacharacters. Any subpattern inside a pair of parentheses will be captured as a group.


#### `(…)` Capture Group

Imagine that you had a command line tool to list all the image files you have in the cloud. You could then use a pattern such as `^(IMG\d+\.png)$` to capture and extract the full filename, but if you only wanted to capture the filename without the extension, you could use the pattern `^(IMG\d+)\.png$`, which only captures the part before the period.

#### `case ... match` Capture Group in Scala


```scala
val date = raw"(\d{4})-(\d{2})-(\d{2})".r
```

To extract the capturing groups when a Regex is matched, use it as an extractor in a pattern match:

```scala
"2004-01-20" match {
  case date(year, month, day) => s"$year $month $day"
}
```

To check only whether the Regex matches, ignoring any groups, use a sequence wildcard:

```scala
"2004-01-20" match {
  case date(_*) => "It's a date!"
}
```

Extracting only the year from a date could also be expressed with a sequence wildcard:

```scala
"2004-01-20" match {
  case date(year, _*) => s"$year"
}
```

In a pattern match, Regex matches the entire input typically. However, an unanchored Regex finds the pattern anywhere in the input.

```scala
val embeddedDate = date.unanchored

"Date: 2004-01-20 17:25:18 GMT (10 years, 28 weeks, 5 days, 17 hours and 51 minutes ago)" match {
  case embeddedDate("2004", "01", "20") => "A Scala is born."
}

```

#### Exercise

Capture 
- `file_record_transcript` in	`file_record_transcript.pdf` 
- `file_07241999` in `file_07241999.pdf`	

and skip `testfile_fake.pdf.tmp`

**Solution**

`^(file_\S+).pdf$`

#### `(a(bc))` Capture Sub-group

#### `(.*)` Capture all


In [28]:
val pattern = "^(file_\\S+).pdf$".r
val text = "file_record_transcript.pdf"

pattern findFirstIn text 

text match {
  case pattern(file) => println(s"$file")
}

file_record_transcript


[36mpattern[39m: [32mRegex[39m = ^(file_\S+).pdf$
[36mtext[39m: [32mString[39m = [32m"file_record_transcript.pdf"[39m
[36mres27_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m"file_record_transcript.pdf"[39m)

In [29]:
val pattern = "^(file_\\S+).pdf$".r
val text = "testfile_fake.pdf.tmp"

pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = ^(file_\S+).pdf$
[36mtext[39m: [32mString[39m = [32m"testfile_fake.pdf.tmp"[39m
[36mres28_2[39m: [32mOption[39m[[32mString[39m] = [32mNone[39m

In [30]:
val date = raw"(\d{4})-(\d{2})-(\d{2})".r
val embeddedDate = date.unanchored

"Date: 2004-01-20 17:25:18 GMT (10 years, 28 weeks, 5 days, 17 hours and 51 minutes ago)" match {
  case embeddedDate("2004", "01", "20") => "A Scala is born."
}

[36mdate[39m: [32mRegex[39m = (\d{4})-(\d{2})-(\d{2})
[36membeddedDate[39m: [32mscala[39m.[32mutil[39m.[32mmatching[39m.[32mUnanchoredRegex[39m = (\d{4})-(\d{2})-(\d{2})
[36mres29_2[39m: [32mString[39m = [32m"A Scala is born."[39m

In comparison, we cannot capture the group if we only use the `date`.

```scala
val date = raw"(\d{4})-(\d{2})-(\d{2})".r

"Date: 2004-01-20 17:25:18 GMT (10 years, 28 weeks, 5 days, 17 hours and 51 minutes ago)" match {
  case date("2004", "01", "20") => "A Scala is born."
}

```

Error msg: 

```
scala.MatchError: 
Date: 2004-01-20 17:25:18 GMT (10 years, 28 weeks, 5 days, 17 hours and 51 minutes ago) 
(of class java.lang.String)
```

-------------------
### Optional

#### `?`:  match either zero or one of the preceding character or group

For example, `ab?c`: match either the strings `"abc"` or `"ac"` because the b is considered optional.

> **Note**
The question mark is a special character and you will have to escape it using a slash `\?` to match a plain question mark character in a string.

#### `(foo|bar)`: match `foo` or `bar`

For example, `(abc|def)`: match `abc` or `def`

#### Exercise

Match	
- `1 file found?`
- `2 files found?`
- `24 files found?`	

Skip `No files found.`

**Solution**
`\d+ files? found\?`



In [31]:

val pattern = "ab?c".r
val text = "abc"
pattern findFirstIn text 


[36mpattern[39m: [32mRegex[39m = ab?c
[36mtext[39m: [32mString[39m = [32m"abc"[39m
[36mres30_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m"abc"[39m)

----------------

### Capture all

#### `.*` would match everything

In [32]:
val pattern = ".*".r
val text = "****** any text 123456 ------------"

pattern findFirstIn text 

[36mpattern[39m: [32mRegex[39m = .*
[36mtext[39m: [32mString[39m = [32m"****** any text 123456 ------------"[39m
[36mres31_2[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m([32m"****** any text 123456 ------------"[39m)

-----------------------------------------------


## Reference

1. Regexone.com. 2021. RegexOne - Learn Regular Expressions - Lesson 1: An Introduction, and the ABCs.
[online] Available at: [RegexOne - Learn Regular Expressions, 2021](https://regexone.com/lesson/introduction_abcs) [Accessed 5 June 2021].
1. Tutorialspoint.com. 2021. Scala - Regular Expressions - Tutorialspoint. [online] Available at: [Scala -
Regular Expressions - Tutorialspoint, 2021](https://www.tutorialspoint.com/scala/scala_regular_expressions.htm) [Accessed 5 June 2021]
1. Dib, F., 2021. regex101: build, test, and debug regex. [online] regex101. Available at: [Dib, 2021](https://regex101.com/) [Accessed 5 June 2021].
1. Scala-lang.org. 2021. Scala Standard Library 2.12.5 - scala.util.matching.Regex. [online] Available at: [Scala Standard Library 2.12.5 - scala.util.matching.Regex, 2021](https://www.scala-lang.org/api/2.12.5/scala/util/matching/Regex.html) [Accessed 7 June 2021].