# Regular Expression

## Specify Pattern Using RegEx

### MetaCharacters
Metacharacters are characters that are interpreted in a special way by a RegEx engine. Here's a list of metacharacters:

- [] . ^ $ * + ? {} () \ |

#### [] - Square brackets

- Square brackets specifies a set of characters you wish to match.
    - [a-e] is the same as [abcde].
    - [1-4] is the same as [1234].
    - [0-39] is the same as [01239].
    - [^abc] means any character except a or b or c.
    - [^0-9] means any non-digit character.   

#### . - Period

- A period matches any single character (except newline '\n').

#### ^ - Caret

- The caret symbol ^ is used to check if a string starts with a certain character.

#### $ - Dollar

- The dollar symbol $ is used to check if a string ends with a certain character.

#### * - Star

- The star symbol * matches zero or more occurrences of the pattern left to it.

#### + - Plus

- The plus symbol + matches one or more occurrences of the pattern left to it.

#### ? - Question Mark

- The question mark symbol ? matches zero or one occurrence of the pattern left to it.

#### {} - Braces

- Consider this code: {n,m}. At least n, and at most m repetitions of the pattern left to it.

#### | - Alternation

- Vertical bar | is used for alternation (or operator).

#### () - Group

- Parentheses () is used to group sub-patterns. 
- For example, (a|b|c)xz match any string that matches either a or b or c followed by xz

#### \ - Backslash

- Backlash \ is used to escape various characters including all metacharacters.

#### Special Sequences

- \A : Matches if the specified characters are at the start of a string.
- \b : Matches if the specified characters are at the beginning or end of a word.
- \B : Opposite of \b. Matches if the specified characters are not at the beginning or end of a word.
- \d : Matches any decimal digit. Equivalent to [0-9]
- \D : Matches any non-decimal digit. Equivalent to [^0-9]
- \s : Matches where a string contains any whitespace character. Equivalent to [ \t\n\r\f\v].
- \S : Matches where a string contains any non-whitespace character. Equivalent to [^ \t\n\r\f\v].
- \w : Matches any alphanumeric character (digits and alphabets). Equivalent to [a-zA-Z0-9_]. 
- \W : Matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9_]

# Regular Expressions in Python

In [1]:
import re

## Using r prefix before RegEx

- When r or R prefix is used before a regular expression, it means raw string. 

In [2]:
s = 'Hi\nHello'
raw_s = r'Hi\nHello'

In [3]:
print(s)
print(raw_s)

Hi
Hello
Hi\nHello


In [4]:
string ='\nection\n\\n'
print(string)

result = re.findall('\n', string)
#result = re.findall('\\\\n', string)
#result = re.findall(r'\\n', string)

print(result)


ection
\n
['\n', '\n']


## re.findall()
- The **re.findall()** method returns a list of strings containing all matches.
- Check **re.search()** as well for your reference

In [5]:
string = 'I like this 34a class. Very much 64. 2389, but love another class with 29039812.3882199slie.c222'
pattern = '\d+'
result = re.findall(pattern, string) 
print(result)

['34', '64', '2389', '29039812', '3882199', '222']


In [8]:
string = 'hello 12 hi 89. Howdy 34, +'
result = re.findall('[\d+, \s+]', string)
result2 = re.findall('\s', string)

print(result)
print(result2)

[' ', '1', '2', ' ', ' ', '8', '9', ' ', ' ', '3', '4', ',', ' ', '+']
[' ', ' ', ' ', ' ', ' ', ' ']


## re.split()
- The **re.split** method splits the string where there is a match and returns a list of strings where the splits have occurred.

In [9]:
string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'
result = re.split(pattern, string) 
print(result)

['Twelve:', ' Eighty nine:', '.']


## re.sub()
re.sub(pattern, replace, string)
<br>
- String where matched occurrences are replaced with content of replace variable.

In [11]:
string = 'abc 12\
de 23 \n f45 6'
print("current:"+string)
print("\n")

pattern = '\s+' # matches all whitespace characters
replace = 'aaaaaaaaa' # empty string
new_string = re.sub(pattern, replace, string) 
print("new:"+new_string)


current:abc 12de 23 
 f45 6


new:abcaaaaaaaaa12deaaaaaaaaa23aaaaaaaaaf45aaaaaaaaa6


In [12]:
# multiline string
string = 'abc 12\
de 23 \n f45 6'
print("current:", string)

# matches all whitespace characters
pattern = '\s+'
replace = ''
new_string = re.sub(r'\s+', replace, string, 2) 
print("new:"+new_string)

current: abc 12de 23 
 f45 6
new:abc12de23 
 f45 6


## Match object
- You can get methods and attributes of a match object using dir() function.

- Some of the commonly used methods and attributes of match objects are:

### match.group()
- The *group()* method returns the part of the string where there is a match.

### match.start(), match.end() and match.span()
- The **start()** function returns the index of the start of the matched substring. 
- Similarly, **end()** returns the end index of the matched substring.
- The **span()** function returns a tuple containing start and end index of the matched part.

### match.re and match.string
- The **re** attribute of a matched object returns a regular expression object. 
- Similarly, **string** attribute returns the passed string.