# Regex

A regular Expression (shortened as Regex or Regexp) is a sequence of characters that specifies a search pattern is a text.
- It's particularly usefull when you have to search a pattern in a Corpus of texts. 
- A Regex function will search de Corpus and return all Matches.
- The Corpus can be a document, phrase or a collection.

In [22]:
import re

# re.compile() will create a regex object with a pattern that you defined.
pattern = r'([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+'
email_regex = re.compile(pattern)

# The search method from the regex object that we created returns None if that pattern wasn't found.
# If the pattern is found, this method returns a Match object.
# Match objects have a group() method that returns the text that matched before.

text1 = "this is not a valid email: @false.com_email"
text2 = "this is a valid email: nome@domain.topleveldomain"

match1 = email_regex.search(text1)
match2 = email_regex.search(text2)

print("match1 is a None object: ", match1 is None)
print("match2 email found: ", match2.group())


match1 is a None object:  True
match2 email found:  nome@domain.topleveldomain


## Basics

Character inside square brackets means a disjunction for characters search. For example:

- Pattern: [bB]rasil -> Match: Brasil, brasil
- Pattern: [0123456789] -> Match: Any digit (but only 1)

We can also use a '-' to define a interval, for example:

- Pattern: [0-9] -> Match: Any digit (but only 1)
- Pattern: [a-z] -> Match: one 'lower' character.
- Pattern: [A-Z] -> Match: one 'upper' character.

We can also user a '^' to deny a disjunction, for example:

- Pattern: [^A-Z] -> Match: one but not 'upper' character, same as [a-z]
- Pattern: [^Aa]  -> Match: First character not A neither a.

It's importante to note that negation just happens when you use '^' as the first thing inside square brackets.

- Pattern: [a^b] -> Match: The pattern 'a^b'

We can also user '|' to define a OR relationship:

- Pattern: a|b|c -> Match: = [abc]

Other importante special characther: '?', '*', '+', '.', '^', '$', '\'

- Pattern: T?he  -> Match: The or he (left character from '?' is optional)
- Pattern: aa*h! -> Match: ah!, aah!, aaaaaaaaaaaah! (left character from '*' 0 or more times)
- Pattern: o+h!  -> Match: oh!, ooh!, ooooooooooooh! (left character from '+' 1 or more times)
- Pattern: M.!   -> Match: Ma!, Mb!, Mc!, ... (Any characther)
- '^' outside square brackets means the beggining of a line.
- '$' means the end of a line.
- '\' means that the right character is not special. So you can use \. to match a period '.' and not any character.

- {n} means n ocurrences of the previous character or expression.
- {n, m} from n to m occurrences of the previous char or expression.
- {n, } at least n occurrences of the previous char or expression.
- \n a newline
- \t a tab
- \d any digit
- \D any non-digit
- \w any alphanumeric or underscore = [a-zA-Z0-9_]
- \W a non-alphanumeric = [^\w]
- \s whitespace(space, tab) = [ \t\n\r\f\v]
- \S Non-whitespace = [^\s]