# My concise Regular Expression practices
### Some theory from Dan Jurafsky's NLP course on Coursera
**Goal of Regular Expressions (aka regex)**: It's the most basic tool for text processing.<br>
**Usage examples**: Dealing with different formats of a word (singular/plural, starting with upper/lower case letter, or any combination of these).<br>
**Disjunction** `[]`:The simplest tool in regex. Meaning _any letter inside square brackets_ <br>
**Ranges** `[0-9], [a-z], [A-Z]`: Used to avoid annoying and long expressions. <br>
* Example: All alphabetical characters: `[A-Za-z]`
* Example: Space and exclamation mark: `[ !]`

**Negation (caret sign)**: `^` Used when we don’t want a set of characters (everything except those characters). <br> Caret means negation only when first in `[]`, in other cases it just means caret.
<br>\* Searching for only caret: Use `\^`
* Example: Not a capital letter: `[^A-Z]` 
* Example: Neither ‘S’ nor ‘s’: `[^Ss]`
* Example: Neither ‘e’ nor ‘^’: `[^e^]`

**Pipe (or, `‘|’`) for disjunction**: To perform ‘or’ operation
* Example: `a|b|c` = `[abc]`
* Example: `[gG]rounhog`
* Example: `[Ww]oodchuck`

**Special characters**:
* `?`: Optional previous character
 * Example: `colou?r` matches both `colour` and `color`
* `*`: Zero or more of previous character
* `+`: One or more of previous character
 * `*` and `+` are called Kleene operators (for Steven C. Kleene)
 * Example: `oo*h!` = `o+h!`
* `.` : Any character
 * Example: `beg.n` matches with `begin`, `began`, `begun`, `beg3n`, etc.

**Anchors (special characters)**:
* `^`: Matches the beginning of the line
* `$`: Matches the end of the line
 * Example: `\.$` matches a period at the end of a line
 * A period by itself (`.`) means any character. Back-slash period (`\.`) means a real period.<br>


* Example: Find all instances of the word “the” in a text.
 * Answer: `[^A-Za-z][tT]he[^A-Za-z]`


### From SoloLearn
This part contains some new materials, and also the implementation of regular expressions in Python.

**What is regex?** A powerful tool for text manipulation.<br>
They are a domain specific language (DSL) that is present as a library in most modern programming languages, not just Python.<br>
They are useful for two main tasks:
* verifying that strings match a pattern (for instance, that a string has the format of an email address), 
* performing substitutions in a string (such as changing all American spellings to British ones).

Regular expressions in Python can be accessed using the re module, which is part of the standard library.<br>
After you've defined a regular expression, the `re.match` function can be used to determine whether it matches at the beginning of a string. If it does, match returns an object representing the match, if not, it returns None.<br>
To avoid any confusion while working with regular expressions, we would use raw strings as r"expression".Raw strings don't escape anything, which makes the use of regular expressions easier.

In [9]:
# Importing Python regex package
import re

In [1]:
# Example:
pattern = r"spam"

if re.match(pattern, "spamspamspam"):
   print("Match")
else:
   print("No match")

Match


Other functions to match patterns are `re.search`, `re.findall` and `re.finditer`. 
* The function `re.search` finds a match of a pattern anywhere in the string.
* The function `re.findall` returns a list of all substrings that match a pattern.
* The function `re.finditer` does the same thing as `re.findall`, except it returns an iterator, rather than a list.

In [2]:
# Example:
pattern = r"spam"

if re.match(pattern, "eggspamsausagespam"):
   print("Match")
else:
   print("No match")

if re.search(pattern, "eggspamsausagespam"):
   print("Match")
else:
   print("No match")
    
print(re.findall(pattern, "eggspamsausagespam"))

No match
Match
['spam', 'spam']


The regex `search` method returns an object with several methods that give details about it. <br>
These methods include `group` which returns the string matched, `start` and `end` which return the start and ending positions of the first match, and `span` which returns the start and end positions of the first match as a tuple.

In [3]:
# Example:
pattern = r"pam"

match = re.search(pattern, "eggspamsausage")
if match:
   print(match.group())
   print(match.start())
   print(match.end())
   print(match.span())

pam
4
7
(4, 7)


Method `sub` replaces all occurrences of the pattern in string with repl, substituting all occurrences, unless count provided. This method returns the modified string.<br>
Syntax:
```python
re.sub(pattern, repl, string, count=0)
```

**Metacharacters** are what make regular expressions more powerful than normal string methods. They allow you to create regular expressions to represent concepts like "one or more repetitions of a vowel". <br>
The existence of metacharacters poses a problem if you want to create a regex that matches a literal metacharacter, such as `$`. You can do this by escaping the metacharacters by putting a backslash (`\`) in front of them. However, this can cause problems, since backslashes also have an escaping function in normal Python strings. This can mean putting three or four backslashes in a row to do all the escaping. To avoid this, you can use a raw string, which is a normal string with an `r` in front of it.

`.`: Matches any character, other than a new line.

In [4]:
# Example:
pattern = r"gr.y"

if re.match(pattern, "grey"):
   print("Match 1")

if re.match(pattern, "gray"):
   print("Match 2")

if re.match(pattern, "blue"):
   print("Match 3")

Match 1
Match 2


`^` and `$`: These match the start and end of a string, respectively.

In [5]:
# Example:
pattern = r"^gr.y$"

if re.match(pattern, "grey"):
   print("Match 1")

if re.match(pattern, "gray"):
   print("Match 2")

if re.match(pattern, "stingray"):
   print("Match 3")


Match 1
Match 2


* The pattern `^gr.y$` means that the string should start with gr, then follow with any character, except a newline,
and end with y.

**Character classes** provide a way to match only one of a specific set of characters.<br>
A character class is created by putting the characters it matches inside square brackets `[]`.

In [6]:
# Example:
pattern = r"[aeiou]"

if re.search(pattern, "grey"):
   print("Match 1")

if re.search(pattern, "qwertyuiop"):
   print("Match 2")

if re.search(pattern, "rhythm myths"):
   print("Match 3")

Match 1
Match 2


Character classes can also match ranges of characters. 
* Example: `[G-P]` matches any uppercase character from G to P.

Multiple ranges can be included in one class. For example, [A-Za-z] matches a letter of any case

In [7]:
#Example:
pattern = r"[A-Z][A-Z][0-9]"

if re.search(pattern, "LS8"):
   print("Match 1")

if re.search(pattern, "E3"):
   print("Match 2")

if re.search(pattern, "1ab"):
   print("Match 3")

Match 1


`^`: Place it at the start of a character class to invert it. This causes it to match any character other than the ones included. Other metacharacters such as `$` and `.`, have no meaning within character classes. The metacharacter ^ has no meaning unless it is the first character in a class. <br>
* Example: The pattern `[^A-Z]` excludes uppercase strings.

Note that `^` should be inside the brackets to invert the character class.

In [8]:
# Example:
pattern = r"[^A-Z]"

if re.search(pattern, "this is all quiet"):
   print("Match 1")

if re.search(pattern, "AbCdEfG123"):
   print("Match 2")

if re.search(pattern, "THISISALLSHOUTING"):
   print("Match 3")

Match 1
Match 2


Some more metacharacters are `*`, `+`, `?` and `{}`, which specify numbers of repetitions.<br>
`*` means "zero or more repetitions of the previous thing". It tries to match as many repetitions as possible. The "previous thing" can be a single character, a class, or a group of characters in parentheses. <br>

In [10]:
# Example:
pattern = r"egg(spam)*"

if re.match(pattern, "egg"):
   print("Match 1")

if re.match(pattern, "eggspamspamegg"):
   print("Match 2")

if re.match(pattern, "spam"):
   print("Match 3")

Match 1
Match 2


`+` is very similar to `*`, except it means "**one** or more repetitions", as opposed to "**zero** or more repetitions".

In [11]:
# Example:
pattern = r"g+"

if re.match(pattern, "g"):
   print("Match 1")

if re.match(pattern, "gggggggggggggg"):
   print("Match 2")

if re.match(pattern, "abc"):
   print("Match 3")

Match 1
Match 2


`?` means "**zero or one** repetitions".

In [12]:
# Example:
pattern = r"ice(-)?cream"

if re.match(pattern, "ice-cream"):
   print("Match 1")

if re.match(pattern, "icecream"):
   print("Match 2")

if re.match(pattern, "sausages"):
   print("Match 3")

if re.match(pattern, "ice--ice"):
   print("Match 4")

Match 1
Match 2


Curly braces (`{}`) can be used to represent the number of repetitions between two numbers.
The regex `{x,y}` means "between x and y  repetitions of something". 
Hence `{0,1}` is the same thing as `?`.<br>
If the first number is missing, it is taken to be zero. <br>
If the second number is missing, it is taken to be infinity. <br>

In [13]:
# Example:
pattern = r"9{1,3}$"

if re.match(pattern, "9"):
   print("Match 1")

if re.match(pattern, "999"):
   print("Match 2")

if re.match(pattern, "9999"):
   print("Match 3")

Match 1
Match 2


**Groups** <br>
A group can be created by surrounding part of a regular expression with parentheses. This means that a group can be given as an argument to metacharacters such as `*` and `?`.

In [14]:
# Example:
pattern = r"egg(spam)*"

if re.match(pattern, "egg"):
   print("Match 1")

if re.match(pattern, "eggspamspamspamegg"):
   print("Match 2")

if re.match(pattern, "spam"):
   print("Match 3")

Match 1
Match 2


The content of groups in a match can be accessed using the group function. <br>
A call of `group(0)` or `group()` returns the whole match. <br>
A call of `group(n)`, where n is greater than 0, returns the nth group from the left. <br>
The method `groups()` returns all groups up from 1. <br>
Groups can be nested.

In [15]:
# Example:
pattern = r"a(bc)(de)(f(g)h)i"

match = re.match(pattern, "abcdefghijklmnop")
if match:
   print(match.group())
   print(match.group(0))
   print(match.group(1))
   print(match.group(2))
   print(match.groups())

abcdefghi
abcdefghi
bc
de
('bc', 'de', 'fgh', 'g')


There are several kinds of special groups. Two useful ones are named groups and non-capturing groups. <br>
Named groups have the format `(?P<name>...)`, where _name_ is the name of the group, and _..._ is the content. They behave exactly the same as normal groups, except they can be accessed by group(name) in addition to its number. <br>
Non-capturing groups have the format `(?:...)`. They are not accessible by the group method, so they can be added to an existing regular expression without breaking the numbering.

In [16]:
# Example:
pattern = r"(?P<first>abc)(?:def)(ghi)"

match = re.match(pattern, "abcdefghi")
if match:
   print(match.group("first"))
   print(match.groups())

abc
('abc', 'ghi')


`|` is another metacharacter, and means "or".
* Example: `red|blue` matches either `red` or `blue`.

In [17]:
# Example:
pattern = r"gr(a|e)y"

match = re.match(pattern, "gray")
if match:
   print ("Match 1")

match = re.match(pattern, "grey")
if match:
   print ("Match 2")    

match = re.match(pattern, "griy")
if match:
    print ("Match 3")

Match 1
Match 2


**Special Sequences** <br>
They are written as a backslash followed by another character. One useful special sequence is a backslash and a number between 1 and 99, e.g., `\1` or `\17`. This matches the expression of the group of that number.

In [18]:
# Example:
pattern = r"(.+) \1"

match = re.match(pattern, "word word")
if match:
   print ("Match 1")

match = re.match(pattern, "?! ?!")
if match:
   print ("Match 2")    

match = re.match(pattern, "abc cde")
if match:
   print ("Match 3")

Match 1
Match 2


* Note that `(.+) \1` is not the same as `(.+) (.+)`, because `\1` refers to the first group's subexpression, which is the matched expression itself, and not the regex pattern.

More useful special sequences are `\d`, `\s`, and `\w`. These match digits, whitespace, and word characters respectively. <br>
In ASCII mode they are equivalent to `[0-9]`, `[ \t\n\r\f\v]`, and `[a-zA-Z0-9_]`.
In Unicode mode they match certain other characters, as well. For instance, `\w` matches letters with accents.
Versions of these special sequences with upper case letters - `\D`, `\S`, and `\W` - mean the opposite to the lower-case versions. For instance, `\D` matches anything that isn't a digit.


In [19]:
# Example:
pattern = r"(\D+\d)"

match = re.match(pattern, "Hi 999!")

if match:
   print("Match 1")

match = re.match(pattern, "1, 23, 456!")
if match:
   print("Match 2")

match = re.match(pattern, " ! $?")
if match:
    print("Match 3")

Match 1


* `(\D+\d)` matches one or more non-digits followed by a digit.

Additional special sequences are `\A`, `\Z`, and `\b`. <br>
The sequences `\A` and `\Z` match the beginning and end of a string, respectively. <br>
The sequence `\b` matches the empty string between `\w` and `\W` characters, or `\w` characters and the beginning or end of the string. Informally, it represents the boundary between words.<br>
The sequence `\B` matches the empty string anywhere else.

In [20]:
# Example:
pattern = r"\b(cat)\b"

match = re.search(pattern, "The cat sat!")
if match:
   print ("Match 1")

match = re.search(pattern, "We s>cat<tered?")
if match:
   print ("Match 2")

match = re.search(pattern, "We scattered.")
if match:
   print ("Match 3")

Match 1
Match 2


`\b(cat)\b` basically matches the word "cat" surrounded by word boundaries.