## Regular Expression

A regular expression, often called a pattern, is **an expression used to specify a set of strings** required for a particular purpose. 

- A simple way to specify a finite set of strings is to list its elements or members. <br>For example `{Doc1,1,Doc2,2,Doc3,3}`. 
    

`{Doc1,Doc2,Doc3}` can be specified by the pattern `Doc(1|2|3)`. <br>We say that this pattern matches each of the two strings. [Lets check?](https://regex101.com/)

> In most formalisms, if there exists at least one regular expression that matches a particular set then there exists an infinite number of other regular expressions that also match it, i.e. **the specification is not unique**.<br>
For example, the string set `{Doc1,Doc2,Doc3}` can also be specified by the pattern `Doc\d`.



# Regex Sample Data(Self Note)
Such large number of strings are often called as `corpus`
```
Vaibhav
1234567899
vaibhavsaran123@gmail.com
23

Ram
1233567899
ram321@gmail.com
33

Shyam
1234567799
shyam456@gmail.com
43

ghanshyam
1234467899
ghan654@gmail.com
53
```
- Pattern to extract email id: `\w+@\w+\.\w+`
- Pattern to extract phone number: `\d{10}`
<br>
Task: What will be the pattern to extract all ages from the above corpus?
<br>

Ans:
`\d{2}`

## Uses of Regular Expressions

**Some important usages of regular expressions are:**

- Check if an input honors a given pattern; for example, we can check whether a value entered in a HTML formulary is a valid e-mail address
> `vaibhavsaran123@gmail.com`

- Look for a pattern appearance in a piece of text; for example, check if either the word "color" or the word "colour" appears in a document with just **one scan**
> `I like Red color and i am wearing a Red colour shirt`

- Extract specific portions of a text; for example, extract the postal code of an address
> `Mr John Smith. 132, My Street, Kingston, New York 12401.`

- Replace portions of text; for example, change the appearance of "color" with "colour"
> `I like Red colour and i am wearing a Red colour shirt`

- Split a larger text into smaller pieces, for example, splitting a text by any appearance of the dot, comma, or newline characters
> `myself person1,you are person2`

# Meta Characters

- All meta characters. `^ $ * + ? { } \ | ( ) `

  1. `.` : any character (limitation: `except new line character`)
  2. `^`(Carat symbol) : startswith `^word`
  3. `$` : endswith `word$`
  4. `*` : zero or more occurrences (limitation: `it can end up capturing unnecessary occurences`)
  5. `+` : one or more occurrences
  6. `{}` : exactly specified no of occurrences `M{2}`
  7. `[]` : A set of characters `[a-c]`
  8. `\` : Signals a special sequence (can also be used to escape special characters) `\d`
  9. `|` : Either or `apple|iphone`
  10. `()` : Capture and group

- `.*` this pattern will capture all the occurences even if they don't occur, even next line and take grouped matches
- `.+` this pattern will only select one or more occurrences, and it will take individual matches

## Data Examples(Self):
- for `.*` and `.+` and `a+`
```
a
aa
aaa
aabbbbbbaa
aaaaa
```
<br>

- for ^ and $:<br>

```
Virat is a chaser
Rohit is the hitman
Dhoni is the finisher
Veeru was the hitman
```

Pattern: `^Virat` ; `^Virat.+` ; `\w{3}`; `\w.+`; `hitman$` ; `.+hitman\$`
<br>
<br>
- For {} and []:
<br>
```
a
aa
aaa
aabbbbbbaa
aaaaa
aaaaaaa
aaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaa
```
<br>

Pattern: 
`a{3}` ; `a{3,7}`(basically like slicing specifying min and max occurrence);`a{3,}`(3 or more occurences) 
<br>
<br>

- For explaining `\` metacharacter:

```
price $750
vaibhav@gmail.com
vaibhav@gmail#com
price 
```

Pattern:  
   1. \w+ \$\d+
   2. price \$750 (\ before the metacharacter disables its power and is treated as a normal literal)
   3. vaibhav@gmail.com (This will detect both mail id because . captures everything)
   4. vaibhav@gmail\.com (. power neutralized) [Alternate: \w+@\w+\.\w+]
<br>
<br>
- For `|`:
```
I am using iphone
I am using apple
```
Pattern: `iphone|apple`
<br>
<br>
- for `[]`:
```
abcd
ABCD
1234
@#!
price $420c
VaibHAV SaRan
```
Pattern: <br>
    1.  `[a-z]`<br>
    2.  `[a-z]|[A-Z]` or `[a-zA-Z]`<br>
    3.  `[0-9]`<br>
    4.  `[a-zA-Z0-9]`(Captures all alphabets and numbers as separate matches)<br>
    5.  `[a-zA-Z0-9]+` (Same as above but as 1 continuous line match)<br>
    6.  `[^a-zA-Z0-9\s]+` (Capture Special Symbols, here the carat acts as "not" symbol){It can be read as match everything except all those which are in square brackets}

<br><br>

- for `()`: {Basically to capture only a specific part of data}
```
my email id is vaibhav@gmail.com
vaibhav@gmail.com
email vaibhav@gmail.com
email id vaibhav@gmail.com
```
Pattern:<br>
    1. `email id vaibhav@gmail.com` {Will capture entire last sentence}<br>
    2. `email id (vaibhav@gmail.com)` {Will only capture whats in the parenthesis}

#### Some examples for set 
 1. `[arn]` Returns a match where one of the specified characters (a, r, or n) are present
 2. `[a-n]` Returns a match for any lower case character, alphabetically between a and n
 3. `[^arn]` Returns a match for any character EXCEPT a, r, and n
 4. `[0123]` Returns a match where any of the specified digits (0, 1, 2, or 3) are present
 5. `[0-9]` Returns a match for any digit between 0 and 9
 6. `[a-zA-Z]` Returns a match for any character alphabetically between a and z, lower case OR upper      case

# Special Sequences Meta Characters
- Purpose is to shorten the pattern
- A special sequence is a \ followed by one of the characters in the list below, and has a special       meaning:

  1. `\d` : Matches any decimal digit; this is equivalent to the class [0-9].
  2. `\D` : Matches any non-digit character; this is equivalent to the class [^0-9].
  3. `\s` : Matches any whitespace character(Simple Space), next line character(\n) or tab(\t);
  4. `\S` : Matches any non-whitespace character;(Capture all alphabets, numbers, special characters)
  5. `\w` : Matches any alphanumeric (word) character; this is equivalent to the class [a-zA-Z0-9_].{This pattern is for alphanumeric type i.e. alphabets numbers and underscores}
  6. `\W` : Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

# Regex Pattern Examples (Self Notes)
Data:<br>
```
Vaibhav 
9829165412
vaibhavsaran123@gmail.com
23

Ram 
1233567899
ram321@gmail.com
33
```
1. Phone number: `\d{10}` or `[0-9]{10}` 
2. Extract Authentic Phone Number: `^[6-9]\d{9}`
<br>
<br>
Data:<br>
```
vaibhav@gmail.com
vaibhav @gmail.com
vaibhav@ gmail.com

```
1. Extract first 2 email: `\w+@\w+\.\w+|\w+\s@\w+\.\w+` or `\w+\s?@\w+\.\w+`(here the ? after \s shows that 1 space may or may not be there)
2. Extract all emails: `\w+\s?@\s?\w+\.\w+`