<a href="https://colab.research.google.com/github/eun-younglee/NLP-100-Exercise/blob/main/Regular_Expression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Regular Expressions
Useful sites

https://realpython.com/regex-python/

https://regex101.com/

re.search(regex, string) # searches a string for a regex match

In [1]:
import re
string = "foo123bar"
re.search('123', string)

<re.Match object; span=(3, 6), match='123'>

In [2]:
if re.search('123', string):
    print("Found a match")
else:
    print("No match")

Found a match


##Metacharacters matching single character
[]: character class

In [3]:
string = "foo123bar"
re.search('[0-9][0-9][0-9]', string)

<re.Match object; span=(3, 6), match='123'>

.: metacharacter matches any character except a newline

In [4]:
string = "foo123bar"
re.search('1.3', string)

<re.Match object; span=(3, 6), match='123'>

In [5]:
string = "foo123bar"
re.search('.b', string)

<re.Match object; span=(5, 7), match='3b'>

^: Anchors a match at the start of a string

$: Anchors a match at the end of a string

In [6]:
string = "foo123bar"
re.search("^f", string)

<re.Match object; span=(0, 1), match='f'>

In [7]:
string = "foo123bar"
re.search("^b", string)

*: Matches zero or more repetitions

In [8]:
string = "foo123bar"
re.search("12*", string)

<re.Match object; span=(3, 5), match='12'>

[]: match any single character contained in the square brackets

In [9]:
string = "fooo123bar"
re.search("[ard]", string)

<re.Match object; span=(8, 9), match='a'>

In [10]:
string = "foo123bar"
re.search("[a-z]", string) # any single character within the range

<re.Match object; span=(0, 1), match='f'>

In [11]:
string = "foo123bar"
re.search('[0-9a-zA-Z]', string)

<re.Match object; span=(0, 1), match='f'>

In [12]:
re.search('[^0-9]', '12345foo') # matches any character that does not start with characters from the set

<re.Match object; span=(5, 6), match='f'>

\w: same as [a-zA-Z0-9]

In [13]:
re.search('\w', "!?@hellothere")

<re.Match object; span=(3, 4), match='h'>

\W: matches any non-word character

In [14]:
re.search('\W', "hello#$?")

<re.Match object; span=(5, 6), match='#'>

\d: matches any decimal digit character, same as [0-9]

\D: matches any non decimal digit character, same as [^0-9]

In [15]:
re.search("\d", "hello123")

<re.Match object; span=(5, 6), match='1'>

In [16]:
re.search("\d", "hello123")

<re.Match object; span=(5, 6), match='1'>

\s: matches any whitespace character, including newline

\S: matches any non whitespace character

In [17]:
re.search("\s", "hello\nthere")

<re.Match object; span=(5, 6), match='\n'>

\\: escaping characters 

In [18]:
re.search(".", "hello.me") # matches any character

<re.Match object; span=(0, 1), match='h'>

In [19]:
re.search("\.", "hello.me") # matches "."

<re.Match object; span=(5, 6), match='.'>

r"string": specify the regex using a raw string, suppresses the escaping at the interpreter level

In [20]:
re.search(r"\\", r"foo\bar")

<re.Match object; span=(3, 4), match='\\'>

## Anchors
Anchors are zero-width matches. They don’t match any actual characters in the search string, and they don’t consume any of the search string during parsing. Instead, an anchor dictates a particular location in the search string where a match must occur.

^: the parser’s current position must be at the beginning of the search string for it to find a match

In [21]:
re.search("^foo", "foo123bar")

<re.Match object; span=(0, 3), match='foo'>

In [22]:
re.search("^foo", "123foobar")

$: the parser’s current position must be at the end of the search string for it to find a match

In [23]:
re.search("foo$", "foo123bar")

In [24]:
re.search("foo$", "123barfoo")

<re.Match object; span=(6, 9), match='foo'>

## Quantifiers
A quantifier metacharacter immediately follows a portion of a regex and indicates how many times that portion must occur for the match to succeed.

+: matches one or more repetitions of the preceding regex

In [25]:
print(re.search("-+", "foobar"))

None


In [26]:
re.search("-+", "foo--bar")

<re.Match object; span=(3, 5), match='--'>

?: matches zero or one repetitions of the preceding regex

In [27]:
re.search("foo-?bar", "foobar")

<re.Match object; span=(0, 6), match='foobar'>

In [28]:
re.search("foo-?bar", "foo--bar")

{m}: matches exactly m repetitions of the preceding regex

In [59]:
re.search("ab{3}", "aaabbbc")

<re.Match object; span=(2, 6), match='abbb'>

{m,n}: matches any number of repetitions of the preceding regex from m to n, inclusive, no space after the comma

In [64]:
re.search("b{3,5}", "aabbbbcc")

<re.Match object; span=(2, 6), match='bbbb'>

In [65]:
re.search("aab{3,5}", "aaabbcccc")

{,}: any number of repetitions

In [66]:
re.search("aab{,}c", "aabbbbbbbbcv")

<re.Match object; span=(0, 11), match='aabbbbbbbbc'>

## Capturing Groups
m.groups(): returns a tuple containing all the captured groups from a regex match.

In [29]:
m = re.search("(\w+), (\w+), (\w+)", 'foo, you, bar')
m

<re.Match object; span=(0, 13), match='foo, you, bar'>

In [30]:
m.groups()

('foo', 'you', 'bar')

In [31]:
m.group(2)

'you'

In [32]:
m.group(0) # returns all elements

'foo, you, bar'

(?:regex): creates a non-capturing group

In [33]:
m = re.search("(\w+), (?:\w+), (\w+)", "foo, quux, baz") # does not capture second word
m.groups()

('foo', 'baz')

## Backreference
\n: matches the contents of a previously captured group

In [34]:
regex = r"(\w+), \1"
m = re.search(regex, "foo, foo")
m

<re.Match object; span=(0, 8), match='foo, foo'>

In [35]:
m = re.search(regex, "foo, you")
print(m)

None


## Miscellaneous Metacharacters
Miscellaneous: of various types or from different sources.

(?#...): specifies a comment

In [36]:
 re.search("bar(?#this is a comment)", "foobar")

<re.Match object; span=(3, 6), match='bar'>

|: specifies a set of alternatives on which to match, non greedy(only first match)

In [37]:
re.search("bar|baz|baq", "foobaz")

<re.Match object; span=(3, 6), match='baz'>

In [38]:
re.search('(foo|bar|baz)+', 'foofoofoo')

<re.Match object; span=(0, 9), match='foofoofoo'>

##Modified Regular Expression Matching With Flags
re.search(regex, string, flags): scans a string for a regex match, applying the specified modifier flags

re.I: IGNORECASE, case insensitive  



In [39]:
re.search('a+', 'aaaAAA', re.I) 

<re.Match object; span=(0, 6), match='aaaAAA'>

re.M: MULTILINE, causes start-of-string and end-of-string anchors to match at embedded newlines

In [40]:
string = "foo\nbar\nbaz"
re.search("^bar", string)

In [41]:
re.search("^bar", string, re.M) 

<re.Match object; span=(4, 7), match='bar'>

In [42]:
re.search("baz$", string, re.M)

<re.Match object; span=(8, 11), match='baz'>

re.X: VERBOSE, llows inclusion of whitespace and comments within a regex

In [45]:
regex = r'''^               # Start of string
             (\(\d{3}\))?    # Optional area code
             \s*             # Optional whitespace
             \d{3}           # Three-digit prefix
             [-.]            # Separator character
             \d{4}           # Four-digit line number
            $               # Anchor at end of string
            '''
re.search(regex, "(143) 243-5326", re.X)

<re.Match object; span=(0, 14), match='(143) 243-5326'>

In [46]:
re.search(regex, "435.3467", re.X)

<re.Match object; span=(0, 8), match='435.3467'>

re.Debug: shows debugging info

In [47]:
re.search('foo.bar', 'fooxbar', re.DEBUG)

LITERAL 102
LITERAL 111
LITERAL 111
ANY None
LITERAL 98
LITERAL 97
LITERAL 114

 0. INFO 12 0b1 7 7 (to 13)
      prefix_skip 3
      prefix [0x66, 0x6f, 0x6f] ('foo')
      overlap [0, 0, 0]
13: LITERAL 0x66 ('f')
15. LITERAL 0x6f ('o')
17. LITERAL 0x6f ('o')
19. ANY
20. LITERAL 0x62 ('b')
22. LITERAL 0x61 ('a')
24. LITERAL 0x72 ('r')
26. SUCCESS


<re.Match object; span=(0, 7), match='fooxbar'>