# Regex

`import re
<str>   = re.sub(<regex>, new, text, count=0)  # Substitutes all occurrences with 'new'.
<list>  = re.findall(<regex>, text)            # Returns all occurrences as strings.
<list>  = re.split(<regex>, text, maxsplit=0)  # Use brackets in regex to include the matches.
<Match> = re.search(<regex>, text)             # Searches for first occurrence of the pattern.
<Match> = re.match(<regex>, text)              # Searches only at the beginning of the text.
<iter>  = re.finditer(<regex>, text)           # Returns all occurrences as match objects.`

1. Search() and match() return None if they can't find a match.
2. Argument 'flags=re.IGNORECASE' can be used with all functions.
3. Argument 'flags=re.MULTILINE' makes '^' and '$' match the start/end of each line.
4. Argument 'flags=re.DOTALL' makes dot also accept the '\n'.
5. Use r'\1' or '\\1' for backreference.
6. Add '?' after an operator to make it non-greedy.

## Match Object
`<str>   = <Match>.group()                      # Returns the whole match. Also group(0).
<str>   = <Match>.group(1)                     # Returns part in the first bracket.
<tuple> = <Match>.groups()                     # Returns all bracketed parts.
<int>   = <Match>.start()                      # Returns start index of the match.
<int>   = <Match>.end()                        # Returns exclusive end index of the match.`

## Special Sequences
1. By default digits, alphanumerics and whitespaces from all alphabets are matched, unless 'flags=re.ASCII' argument is used.
2. Use a capital letter for negation.

`'\d' == '[0-9]'                                # Matches any digit.
'\w' == '[a-zA-Z0-9_]'                         # Matches any alphanumeric.
'\s' == '[ \t\n\r\f\v]'                        # Matches any whitespace.`

## Format

`<str> = f'{<el_1>}, {<el_2>}'
<str> = '{}, {}'.format(<el_1>, <el_2>)`

## General Options

`{<el>:<10}                                     # '<el>      '
{<el>:^10}                                     # '   <el>   '
{<el>:>10}                                     # '      <el>'
{<el>:.<10}                                    # '<el>......'
{<el>:<0}                                      # '<el>'`

In [6]:
s = 'charan123kumar'
'123' in s

True

In [7]:
s.find('123')

6

In [8]:
s.index('123')

6

### The re Module
Regex functionality in Python resides in a module named re. The re module contains many useful functions and methods.

In [11]:
import re
re.search('123',s)

<re.Match object; span=(6, 9), match='123'>

In [12]:
if re.search('123', s):
    print('Found a match.')
else:
    print('No match.')

Found a match.


In [13]:
re.search('[0-9][0-9][0-9]', s)

<re.Match object; span=(6, 9), match='123'>

In [14]:
re.search('[0-9][0-9][0-9]', 'foo456bar')

<re.Match object; span=(3, 6), match='456'>

In [15]:
re.search('[0-9][0-9][0-9]', '234baz')

<re.Match object; span=(0, 3), match='234'>

In [16]:
re.search('[0-9][0-9][0-9]', 'qux678')

<re.Match object; span=(3, 6), match='678'>

In [17]:
print(re.search('[0-9][0-9][0-9]', '12foo34'))

None


With regexes in Python, you can identify patterns in a string that you wouldn’t be able to find with the in operator or with string methods.

Take a look at another regex metacharacter. The dot (.) metacharacter matches any character except a newline, so it functions like a wildcard:

In [23]:
re.search('1.3', s)

<re.Match object; span=(6, 9), match='123'>

In [24]:
re.search('1..3', s)

| Character(s) |	Meaning |
|-----|--------------------------|
| .  |	Matches any single character except newline|
| ^  | Anchors a match at the start of a string |
| $	 | Anchors a match at the end of a string|
| *	 | Matches zero or more repetitions|
| +	 | Matches one or more repetitions|
| ?	∙| 1. Matches zero or one repetition.∙ 2. Specifies the non-greedy versions of *, +, and ?∙3.Introduces a lookahead or lookbehind assertion 4. ∙ Creates a named group|
| {} | Matches an explicitly specified number of repetitions|
| \  | ∙ Escapes a metacharacter of its special meaning  Introduces a special character class  Introduces a grouping backreference |
| [] | Specifies a character class |
| pipe symbol | Designates alternation|
| () | Creates a group |
| : # = ! | Designate a specialized group |
| <> | Creates a named group |

In [25]:
re.search('ba[artz]', 'foobarqux') 

<re.Match object; span=(3, 6), match='bar'>

In [26]:
re.search('ba[artz]', 'foobatqux')

<re.Match object; span=(3, 6), match='bat'>

In [27]:
re.search('ba[artz]', 'foobamqux')

In [29]:
re.search('[a-z]', 'FOObar')

<re.Match object; span=(3, 4), match='b'>

In [30]:
re.search('[0-9a-fA-f]', '--- a0 ---')

<re.Match object; span=(4, 5), match='a'>

In [31]:
# match with starting letter 
re.search('[^0-9]', '1charan')

<re.Match object; span=(1, 2), match='c'>

In [32]:
re.search('[^0-9]', 'a1charan')

<re.Match object; span=(0, 1), match='a'>

In [33]:
re.search('[#:^]', 'foo^bar:baz#qux')

<re.Match object; span=(3, 4), match='^'>

In [36]:
re.search('[#:^]', 'foobarbazqux')

In [41]:
re.search('[-abc]', '123-456')

<re.Match object; span=(3, 4), match='-'>

In [42]:
re.search('[]]', 'foo[1]')

<re.Match object; span=(5, 6), match=']'>

In [43]:
re.search('[)*+|]', '123*456')

<re.Match object; span=(3, 4), match='*'>

In [44]:
re.search('foo.bar', 'fooxbar')

<re.Match object; span=(0, 7), match='fooxbar'>

In [45]:
print(re.search('foo.bar', 'foo\nbar'))
print(re.search('foo.bar', 'foobar'))

None
None


\w matches any alphanumeric word character. Word characters are uppercase and lowercase letters, digits, and the underscore (_) character, so \w is essentially shorthand for [a-zA-Z0-9_]:


In [46]:
re.search('\w', '#(.a$@&')

<re.Match object; span=(3, 4), match='a'>

In [47]:
re.search('[a-zA-Z0-9_]', '#(.a$@&')

<re.Match object; span=(3, 4), match='a'>

\d matches any decimal digit character. \D is the opposite. It matches any character that isn’t a decimal digit:

In [48]:
re.search('\d', 'abc4def')

<re.Match object; span=(3, 4), match='4'>

In [54]:
print(re.search('\D', '234678'))
re.search('\D', '2346Q78')

None


<re.Match object; span=(4, 5), match='Q'>

In [55]:
# \s matches any whitespace character:
re.search('\s', 'foo\nbar baz')

<re.Match object; span=(3, 4), match='\n'>

\S is the opposite of \s. It matches any character that isn’t whitespace:
    
Again, \s and \S consider a newline to be whitespace. In the example above, the first non-whitespace character is 'f'.

The character class sequences \w, \W, \d, \D, \s, and \S can appear inside a square bracket character class as well:
    

When it’s not serving either of these purposes, the backslash escapes metacharacters. A metacharacter preceded by a backslash loses its special meaning and matches the literal character instead. Consider the following examples:

In [56]:
re.search('.', 'foo.bar')
re.search('\.', 'foo.bar')

<re.Match object; span=(3, 4), match='.'>

In [57]:
# \A functions similarly ^ start with :
re.search('\Afoo', 'foobar')

<re.Match object; span=(0, 3), match='foo'>

When the regex parser encounters `$` or `/Z` , the parser’s current position must be at the end of the search string for it to find a match. Whatever precedes $ or \Z must constitute the end of the search string:

In [58]:
 re.search('bar$', 'foobar')

<re.Match object; span=(3, 6), match='bar'>

In [59]:
re.search('bar\Z', 'foobar')

<re.Match object; span=(3, 6), match='bar'>

In [60]:
re.search('bar$', 'foobar\n')

<re.Match object; span=(3, 6), match='bar'>

`\b` asserts that the regex parser’s current position must be at the beginning or end of a word. A word consists of a sequence of alphanumeric characters or underscores ([a-zA-Z0-9_]), the same as for the \w character class:

re.search(r'\bbar', 'foo bar')
re.search(r'\bbar', 'foo.bar')
print(re.search(r'\bbar', 'foobar'))

In [62]:
re.search(r'bar\b', 'foo bar')
re.search(r'bar\b', 'foo.bar')
print(re.search(r'barb', 'foobar'))

None


In [63]:
print(re.search(r'\bbar\b', 'foo bar baz'))
print(re.search(r'\bbar\b', 'foo(bar)baz'))
print(re.search(r'\bbar\b', 'foobarbaz'))

<re.Match object; span=(4, 7), match='bar'>
<re.Match object; span=(4, 7), match='bar'>
None


`\B` does the opposite of `\b`. It asserts that the regex parser’s current position must not be at the start or end of a word:

In [65]:
print(re.search(r'\Bfoo\B', 'foo'))
print(re.search(r'\Bfoo\B', '.foo.'))
re.search(r'\Bfoo\B', 'barfoobaz')

None
None


<re.Match object; span=(3, 6), match='foo'>

### Quantifiers
A quantifier metacharacter immediately follows a portion of a <regex> and indicates how many times that portion must occur for the match to succeed.

`*`

For example, a* matches zero or more 'a' characters. That means it would match an empty string, 'a', 'aa', 'aaa', and so on.

In [67]:
re.search('foo-*bar', 'foobar')

<re.Match object; span=(0, 6), match='foobar'>

In [68]:
re.search('foo-*bar', 'foo-bar')   

<re.Match object; span=(0, 7), match='foo-bar'>

In [69]:
re.search('foo-*bar', 'foo--bar')           

<re.Match object; span=(0, 8), match='foo--bar'>

In [70]:
re.search('foo.*bar', '# foo $qux@grault % bar #')

<re.Match object; span=(2, 23), match='foo $qux@grault % bar'>

In [73]:
re.search('foo.*bar', '# foo $qux@grault % br #')

`+`

Matches one or more repetitions of the preceding regex.

This is similar to `*`, but the quantified regex must occur at least once:

In [75]:
print(re.search('foo-+bar', 'foobar'))  # Zero dashes
print(re.search('foo-+bar', 'foo-bar'))                    # One dash
print(re.search('foo-+bar', 'foo--bar'))                    # One dash

None
<re.Match object; span=(0, 7), match='foo-bar'>
<re.Match object; span=(0, 8), match='foo--bar'>
