#### Regex Basics

Summary: 
- **Regular expressions** are special sequences of characters that describe a pattern of text that is to be matched
- We can use literals to match the exact characters that we desire
- **Alternation**, using the pipe symbol |, allows us to match the text preceding or following the |
- **Character sets**, denoted by a pair of brackets [], let us match one character from a series of characters
- **Wildcards**, represented by the period or dot ., will match any single character (letter, number, symbol or whitespace)
- **Ranges** allow us to specify a range of characters in which we can make a match
- **Shorthand character classes** like \w, \d and \s represent the ranges representing word characters, digit characters, and whitespace characters, respectively
- **Groupings**, denoted with parentheses (), group parts of a regular expression together, and allows us to limit alternation to part of a regex
- **Fixed quantifiers**, represented with curly braces {}, let us indicate the exact quantity or a range of quantity of a character we wish to match
- **Optional quantifiers**, indicated by the question mark ?, allow us to indicate a character in a regex is optional, or can appear either 0 times or 1 time
- The **Kleene star**, denoted with the asterisk *, is a quantifier that matches the preceding character 0 or more times
- The **Kleene plus**, denoted by the plus +, matches the preceding character 1 or more times
- The **anchor** symbols hat ^ and dollar sign $ are used to match text at the start and end of a string, respectively

Key Metacharacters:
- . : Matches any single character except a newline.
- ^ : Matches the start of a string.
- $ : Matches the end of a string.
- "*" : Matches zero or more occurrences of the preceding character.
- "+" : Matches one or more occurrences of the preceding character.
- ? : Matches zero or one occurrence of the preceding character.
- [] : A character class. Matches any character inside the square brackets.
- () : Groups patterns together.
- | : Acts like an OR operator, allowing multiple patterns to be matched.

#### Key Aspects

**Literals** \
Regular expression contains the exact text that we want to match. 
- The regex a, for example, will match the text a, and the regex bananas will match the text bananas.
- The regex 3 will match the 3 in the piece of text 34,

Can use to match parts of text within a string
- When the regular expression finds a character that matches the first piece of the expression, it looks to find a continuous sequence of matching characters.

**Alternation** \
Performed in regular expressions with the pipe symbol, |, allows us to match either the characters preceding the | OR the characters after the |
- The regex baboons|gorillas will match baboons in the text I love baboons, but will also match gorillas in the text I love gorillas.

**Character sets** \
denoted by a pair of brackets [] let us match one character from a series of characters, allowing for matches with incorrect or different spellings.
- regex con[sc]en[sc]us will match consensus, the correct spelling of the word, but also match the following three incorrect spellings: concensus, consencus, and concencus
- the regex [cat] will match the characters c, a, or t, but not the text cat.

**Wildcards**\
. will match any single character (letter, number, symbol or whitespace) in a piece of text. 
- Let’s say we want to match any 9-character piece of text. The regex ......... will completely match orangutan and marsupial
- I ate . bananas will completely match both I ate 3 bananas and I ate 8 bananas

if we want to match an actual period, . use the escape character, \, to escape the wildcard functionality 
- The regex Howler monkeys are really lazy\. will completely match the text Howler monkeys are really lazy.

**Ranges**\
allow us to specify a range of characters in which we can make a match without having to type out each individual character
- regex [abc], which would match any character a, b, or c, is equivalent to regex range [a-c]
- match any single capital letter with the regex [A-Z], lowercase letter with the regex [a-z], any digit with the regex [0-9]
- To match any single capital or lowercase alphabetical character, we can use the regex [A-Za-z]

**Anchors**\
T hat ^ and dollar sign $ are used to match text at the start and the end of a string, respectively.
- ^Monkeys: my mortal enemy$ will completely match the text Monkeys: my mortal enemy but not match Spider Monkeys: my mortal enemy in the wild

##### shorthand character classes 
Represent common ranges, and they make writing regular expressions much simpler. \
- \w: the “word character” class represents the regex range [A-Za-z0-9_], and it matches a single upper \
- \d: the “digit character” class represents the regex range [0-9], and it matches a single digit character \
- \s: the “whitespace character” class represents the regex range [ \t\r\n\f\v], matching a single space, tab, carriage return, line break, form feed, or vertical tab\
- \W: the “non-word character” class represents the regex range [^A-Za-z0-9_], matching any character that is not included in the range represented by \w
- \D: the “non-digit character” class represents the regex range [^0-9], matching any character that is not included in the range represented by \d
- \S: the “non-whitespace character” class represents the regex range [^ \t\r\n\f\v], matching any character that is not included in the range represented by \s

Examples: 
- the regex \d\s\w\w\w\w\w\w\w matches a digit character, followed by a whitespace character, followed by 7 word characters. Thus the regex completely matches the text 3 monkeys.

##### Grouping & Quantifiers 

**Grouping**\
Denoted with the open parenthesis ( and the closing parenthesis ), lets us group parts of a regular expression together, and allows us to limit alternation to part of the regex.

regex I love (baboons|gorillas) will match the text I love and then match either baboons or gorillas, as the grouping limits the reach of the | to the text within the parentheses.

**Quantifiers**\
Denoted with curly braces {}, let us indicate the exact quantity of a character we wish to match, or allow us to provide a quantity range to match on.
- \w{3} will match exactly 3 word characters
- \w{4,7} will match at minimum 4 word characters and at maximum 7 word characters
- regex roa{3}r will match the characters ro followed by 3 as, and then the character r, such as in the text roaaar

**Optional quantifiers** \
indicated by the question mark ?, allow us to indicate a character in a regex is optional, or can appear either 0 times or 1 time
- the regex humou?r matches the characters humo, then either 0 occurrences or 1 occurrence of the letter u, and finally the letter r
- The regex The monkey ate a (rotten )?banana will completely match both The monkey ate a rotten banana and The monkey ate a banana.

**Quantifiers - 0 or More, 1 or More** \
The *Kleene star*, denoted with the asterisk *, is also a quantifier, and matches the preceding character 0 or more times.; the character doesn’t need to appear, can appear once, or can appear many many times.
- regex meo*w will match the characters me, followed by 0 or more os, followed by a w.\\

The *Kleene plus*, denoted by the plus +, which matches the preceding character 1 or more times.
- The regex meo+w will match the characters me, followed by 1 or more os, followed by a w


In [1]:
## import module for regex
import re

### Code Examples

##### Matching Patterns 
Use the re.search() function is used to find the first match of a pattern in a string.

In [2]:
text = "A wild cat appeared!"
match = re.search(r'cat', text)
if match:
    print("Found:", match.group())
else:
    print("Not found.")

Found: cat


##### Character Classes and Quantifiers:
Character classes and quantifiers are used to match specific sets of characters and control the number of occurrences.

re.findall() finds all matches

In [4]:
# Match any 'x' followed by one or more digits
text = "x, x2, x42, x999"
matches = re.findall(r'x\d+', text)
print(matches)  # Output: ['x2', 'x42', 'x999']

## Explanation:

# x matches the character 'x' literally.
# \d+ matches one or more digits (0-9).

['x2', 'x42', 'x999']


##### Groups and Capturing:
Parentheses () are used to create groups. You can capture and extract specific parts of the matched text using groups.

In [5]:
# Extract the area code and local number from a phone number
phone = "Phone: (123) 456-7890"
match = re.search(r'\((\d{3})\) (\d{3}-\d{4})', phone)
if match:
    area_code, local_number = match.groups()
    print("Area Code:", area_code)
    print("Local Number:", local_number)
## Explanation
# \( and \) match literal parentheses in the string. Since parentheses have a special meaning in regular expressions, we need to escape them with backslashes to treat them as ordinary characters.
# (\d{3}): This is a group that matches three digits in a row. \d represents any digit (0-9), and {3} specifies that it should be repeated exactly three times.
# (\d{3}-\d{4}): This is another group that matches a pattern of three digits followed by a hyphen and then four digits.

Area Code: 123
Local Number: 456-7890
