<a href="https://colab.research.google.com/github/dimi-fn/Various-Data-Science-Scripts/blob/main/RegEx.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regular Expressions

Regex strings often use backslashes (like \d), so they are often written using raw strings: e.g. r'\d'

* `d` is the regex with regard to numeric digit characters

> Procedure:

* import the re module first
* create a regex object: call the re.compile() function 
* create a match object: call the regex object's search() method 
* to get the matched string: call the match object's group() method


## Groups


Groups are created in regex strings with parentheses.
The first set of parentheses is group 1, the second is 2, and so on.
Calling group() or group(0) returns the full matching string, group(1) returns group 1's matching string, and so on.
* use `\` to match literal parentheses in the regex string
* the `|` pipe can match one of many possible groups

* `?`: the group matches zero or one times
* `*`: the group matches zero or more times
* `+`: the group matches one or more times
* curly braces can match a specific number of times
  * the curly braces with two numbers matches a minimum and maximum number of times
  * leaving out the first or second number in the curly braces says there is no minimum or maximum.
  * "**Greedy matching**" matches the longest string possible, "**non-greedy matching**" (or "lazy matching") matches the shortest string possible. Putting a question mark after the curly braces makes it do a non-greedy/lazy match.


## findall()

The regex method findall() is passed as a string, and returns all matches in it, not just the first match.
* if the regex has 0 or 1 group, it returns a list of strings
* if the regex has 2 or more groups, it returns a list of tuples of strings
* `\d` is a shorthand character class that matches digits
* `\w` matches "word characters" (letters, numbers, and the underscore)
* `\s` matches whitespace characters (space, tab, newline).
* the uppercase shorthand character classes `\D`, `\W`, and `\S` match charaters that are not digits, word characters, and whitespace

In [1]:
import re
phone_RegEx = re.compile(r'\d\d\d-\d\d\d')
print("The regular expression object is:\n{}".format(phone_RegEx))

The regular expression object is:
re.compile('\\d\\d\\d-\\d\\d\\d')


In [2]:
phone_RegEx.search("this is the number 789-984 and the number 888-999") # it returns the 1st match

<re.Match object; span=(19, 26), match='789-984'>

In [3]:
phone_RegEx.findall("this is the number 789-984 and the number 888-999") # all matches

['789-984', '888-999']