# Regular Expression

In [1]:
import re

# ashishad14@gmail.com

pattern = "[a-z]+[A-Z0-9]+@[a-zA-Z]+\.(com|in|net)"

user_input = input()

if(re.search(pattern, user_input)):
    print("Valid Email")
else:
    print("Invalid email")

Invalid email


In [2]:
import re 
#999-888-1010
#9998881010
pattern = "(\d\d\d)-(\d\d\d)-(\d\d\d\d)"

new_pattern = r"\1\2\3"

user_input = input()

new_user_input = re.sub(pattern,new_pattern,user_input)

print(new_user_input)


9995558888


## 1. What exactly is a Regular Expression?

A regular expression, often called a pattern, is **an expression used to specify a set of strings** required for a particular purpose. 

- A simple way to specify a finite set of strings is to list its elements or members. <br>For example `{file, file1, file2}`. 
    

- However, there are often more concise ways to specify the desired set of strings. <br>For example, the set `{file, file1, file2}` can be specified by the pattern `file(1|2)?`. <br>We say that this pattern matches each of the three strings. [Wanna check?](https://regexr.com/48om5)

> In most formalisms, if there exists at least one regular expression that matches a particular set then there exists an infinite number of other regular expressions that also match it, i.e. **the specification is not unique**.<br>
For example, the string set `{file, file1, file2}` can also be specified by the pattern `file\d?`.


## 2. The math of Regular Expressions

- The concept of **Regular Expressions** originated from **[Regular Languages](https://en.wikipedia.org/wiki/Regular_language)**. 

- **Regular Expressions** describe **Regular Languages** in **[Formal Language Theory](https://en.wikipedia.org/wiki/Formal_language)**.

> ***Formal Language Theory***: In mathematics, computer science, and linguistics, a **formal language** consists of words whose letters are taken from an alphabet and are **well-formed according to a specific set of rules**. The field of formal language theory studies primarily the purely syntactical aspects of such languages—that is, their internal structural patterns.

> ***Regular Languages***: A regular language is a category of **formal languages** which can be expressed using a regular expression. 
![](formal-lang-theory.png)

- **Note**: Today, many regular expressions engines provided by modern programming languages are augmented with features that allow recognition of languages that <span style="color:red;">**cannot**</span> be expressed by a classic regular expression!

## 3. Uses of Regular Expressions

Some important usages of regular expressions are:

- Check if an input honors a given pattern; for example, we can check whether a value entered in a HTML formulary is a valid e-mail address


- Look for a pattern appearance in a piece of text; for example, check if either the word "color" or the word "colour" appears in a document with just **one scan**


- Extract specific portions of a text; for example, extract the postal code of an address


- Replace portions of text; for example, change any appearance of "color" or "colour" with "red"


- Split a larger text into smaller pieces, for example, splitting a text by any appearance of the dot, comma, or newline characters

## 4. A brief history of Regular Expressions

> *The story begins with a neuroscientist and a logician who together tried to understand how the human brain could produce complex patterns using simple cells that are bound together.*

- In 1943, neurophysiologists ***Warren McCulloch*** and ***Walter Pitts*** published ***"A logical calculus of the ideas immanent in nervous activity"***. This paper not only represented the beginning of the regular expressions, but also proposed the first mathematical model of a neural network.


- In 1956, ***Stephen Kleene*** wrote the paper ***"Representation of events in nerve nets and finite automata"***, where he coined the terms **regular sets** and **regular expressions** and presented a simple algebra.


- In 1968, the Unix pioneer ***Ken Thompson***  took Kleene's work and extended it, publishing his studies in the paper ***"Regular Expression Search Algorithm"***. Ken Thompson's work didn't end in just writing a paper. He also implemented Kleene's notation in the editor ***QED***. The aim was that the user could do advanced pattern matching in text files. The same feature appeared later on in the editor ***ed***.

> To search for a Regular Expression in ed you wrote `g/<regular expression>/p` The letter g meant global search and p meant print the result. The command — `g/re/p` — resulted in the standalone program grep, released in the fourth edition of Unix 1973.<br><span style="color:red;">However, **grep** didn't have a complete implementation of regular expressions.</span>

- In 1979, ***Alfred Aho*** developed ***egrep (extended grep)*** in the seventh edition of Unix. The program egrep translated any regular expressions to a corresponding [DFA](https://en.wikipedia.org/wiki/Deterministic_finite_automaton).


- In 1987, ***Larry Wall*** created the scripting language ***Perl***. Regular Expressions are seamlessly integrated in Perl, even with its own literals. Hence, Perl pushed the regular expressions to the mainstream. The implementation in Perl went forward and added many modifications to the original regular expression syntax, creating the so-called ***Perl flavor***.

### Some other worth mentioning milestones

- The IEEE thought their POSIX standard has tried to standardize and give better Unicode support to the regular expression syntax and behaviors. This is called the ***POSIX flavor*** of the regular expressions.


- In late 1980s, ***Henry Spencer*** wrote ***"regex"***, a widely used software library for regular expressions in C programming langauge.


### Here is a brief timeline to summarize...

![history](images/history.png)

### Regex today

- It was the rise of the web that gave a big boost to the Perl implementation of regex, and that's where we get the modern syntax of regular expressions today; it really comes from Perl. `Apache`, `C`, `C++`, `the .NET languages`, `Java`, `JavaScript`, `MySQL`, `PHP`, `Python`, `Ruby` all of these are endeavoring to be Perl-compatible languages and programs. There's also a library called the `PCRE` library that stands for Perl-Compatible Regular Expression library.


- Today, the standard Python module for regular expressions—`re`—supports only Perl-style regular expressions. There is an [effort](https://pypi.python.org/pypi/regex) to write a new regex module with better POSIX style support. This new module is intended to replace Python's `re` module implementation eventually. 

## 5. Understanding the Regular Expression Syntax

A regex pattern is a simple sequence of characters. The components of a regex pattern are:

- **literals (ordinary characters)**: these characters carry no special meaning and are processed as it is.

- **metacharacters (special characters)**: these characters carry a special meaning and processed in some special way.


![](images/components.png)

Let's start with a simple example.

Consider that we have got the list of several filenames in a folder.

```
file1.xml
file1.txt
file2.txt
file15.xml
file5.docx
file60.txt
file5.txt
```

And we want to filter out only those filenames which follow a specific pattern, i.e.  `file<one or more digits>.txt`.

> Let's try to do this on an online tool to learn, build, & test Regular Expressions (RegEx / RegExp), [RegExr](https://regexr.com).

So, the regular expression we need here is:

`file\d+\.txt`

This expression can be understood as follows:

- `file` is a substring of literals which are matched with the input as it is.

- `\d` is a metacharacter which instructs the software to match this position with a digit (0-9).

- `+` is also a metacharacter which instructs the software to match one or more iterations of the preceeding character (`\d` in this case)

- `\.` is a literal. `.` is a metacharacter but we want to use it as a literal in this case. Hence, we escape it using `\` character.

- `txt` is a substring of literals which are matched with the input as it is.

![](images/example1.png)

In [3]:
import re

In [4]:
pattern = re.compile("Python")

In [5]:
pattern

re.compile(r'Python', re.UNICODE)

In [6]:
pattern = re.compile("Python", flags=re.I )

In [7]:
pattern

re.compile(r'Python', re.IGNORECASE|re.UNICODE)

In [8]:
match = pattern.match("python")

In [9]:
match.span()

(0, 6)

In [10]:
match.start()

0

In [11]:
match.end()

6

In [12]:
pattern.match("say python", pos=4) is None

False

In [13]:
pattern.match("python", endpos=5) is None

True

In [14]:
pattern.search("say python")

<re.Match object; span=(4, 10), match='python'>

In [15]:
pattern = re.compile("python", flags = re.I)

In [16]:
pattern.search("say python")

<re.Match object; span=(4, 10), match='python'>

In [17]:
pattern.findall("say python is not python as in computer language")

['python', 'python']

In [18]:
matches = pattern.finditer("say python is not python as in computer language")

In [19]:
for match in matches:
    print(match.span())

(4, 10)
(18, 24)


In [20]:
txt = "This book costs $15."

In [21]:
pattern = re.compile("$15")

In [22]:
pattern.search(txt)

In [23]:
pattern = re.compile("\$15")

In [24]:
pattern.search(txt)

<re.Match object; span=(16, 19), match='$15'>

In [25]:
pattern.search(txt)

<re.Match object; span=(16, 19), match='$15'>

In regular expressions, there are twelve metacharacters that should be escaped if they are to be used with their literal meaning:

- Backslash `\`
- Caret `^`
- Dollar sign `$`
- Dot `.`
- Pipe symbol `|`
- Question mark `?`
- Asterisk `*`
- Plus sign `+`
- Opening parenthesis `(`
- Closing parenthesis `)`
- Opening square bracket `[`
- The opening curly brace `{`

# Character Classes

- The **character classes** (also known as **character sets**) allow us to define a character that will match if any of the defined characters on the set is present.


- To define a character class, we should use the opening square bracket metacharacter `[`, then any accepted characters, and finally close with a closing square bracket `]`.

### Example 1

Consider an example below where we have messed up between `license` and `licence` spellings and want to find all occurances of `license`/`licence` in the text.

In [26]:
txt = """
Yesterday, I was driving my car without a driving licence. The traffic police stopped me and asked me for my 
license. I told them that I forgot my licence at home. 
"""

In [27]:
pattern = re.compile("licen[cs]e")

In [28]:
pattern.findall(txt)

['licence', 'license', 'licence']

In [29]:
txt = """
The first season of Indian Premiere League (IPL) was played in 2008. 
The second season was played in 2009 in South Africa. 
Last season was played in 2018 and won by Chennai Super Kings (CSK).
CSK won the title in 2010 and 2011 as well.
Mumbai Indians (MI) has also won the title 3 times in 2013, 2015 and 2017.
"""

In [30]:
pattern = re.compile("[1-9][0-9][0-9][0-9]")

In [31]:
pattern.findall(txt)

['2008', '2009', '2018', '2010', '2011', '2013', '2015', '2017']

> There is another possibility—the negation of ranges. We can invert the meaning
of a character set by placing a caret (`^`) symbol right after the opening square
bracket metacharacter (`[`).

For example, to find all the characters used in a text except vowels, we can use the pattern:

In [32]:
pattern = re.compile("[^aeiou]")

In [33]:
pattern.findall(txt)

['\n',
 'T',
 'h',
 ' ',
 'f',
 'r',
 's',
 't',
 ' ',
 's',
 's',
 'n',
 ' ',
 'f',
 ' ',
 'I',
 'n',
 'd',
 'n',
 ' ',
 'P',
 'r',
 'm',
 'r',
 ' ',
 'L',
 'g',
 ' ',
 '(',
 'I',
 'P',
 'L',
 ')',
 ' ',
 'w',
 's',
 ' ',
 'p',
 'l',
 'y',
 'd',
 ' ',
 'n',
 ' ',
 '2',
 '0',
 '0',
 '8',
 '.',
 ' ',
 '\n',
 'T',
 'h',
 ' ',
 's',
 'c',
 'n',
 'd',
 ' ',
 's',
 's',
 'n',
 ' ',
 'w',
 's',
 ' ',
 'p',
 'l',
 'y',
 'd',
 ' ',
 'n',
 ' ',
 '2',
 '0',
 '0',
 '9',
 ' ',
 'n',
 ' ',
 'S',
 't',
 'h',
 ' ',
 'A',
 'f',
 'r',
 'c',
 '.',
 ' ',
 '\n',
 'L',
 's',
 't',
 ' ',
 's',
 's',
 'n',
 ' ',
 'w',
 's',
 ' ',
 'p',
 'l',
 'y',
 'd',
 ' ',
 'n',
 ' ',
 '2',
 '0',
 '1',
 '8',
 ' ',
 'n',
 'd',
 ' ',
 'w',
 'n',
 ' ',
 'b',
 'y',
 ' ',
 'C',
 'h',
 'n',
 'n',
 ' ',
 'S',
 'p',
 'r',
 ' ',
 'K',
 'n',
 'g',
 's',
 ' ',
 '(',
 'C',
 'S',
 'K',
 ')',
 '.',
 '\n',
 'C',
 'S',
 'K',
 ' ',
 'w',
 'n',
 ' ',
 't',
 'h',
 ' ',
 't',
 't',
 'l',
 ' ',
 'n',
 ' ',
 '2',
 '0',
 '1',
 '0',
 ' ',
 'n',


# Predefined Character Classes

There exist some predefined character classes which can be used as a shortcut for some frequently used classes.


<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Element</th>
    <th>Description</th>
</thead>
    
<tbody>
<tr>
    <td>.</td>
    <td>This element matches any character except newline</td>
</tr>

<tr>
    <td>\d</td>
    <td>This matches any decimal digit; this is equivalent to the class [0-9]</td>
</tr>

<tr>
    <td>\D</td>
    <td>This matches any non-digit character; this is equivalent to the class [^0-9]</td>
</tr>

<tr>
    <td>\s</td>
    <td>This matches any whitespace character; this is equivalent to the class
[ \t\n\r\f\v]</td>
</tr>

<tr>
    <td>\S</td>
    <td>This matches any non-whitespace character; this is equivalent to the class
[^ \t\n\r\f\v]</td>
</tr>

<tr>
    <td>\w</td>
    <td>This matches any alphanumeric character; this is equivalent to the class
[a-zA-Z0-9_]</td>
</tr>
    
<tr>
    <td>\W</td>
    <td>This matches any non-alphanumeric character; this is equivalent to the
class [^a-zA-Z0-9_]</td>
</tr>
</tbody>
</table>


Now, we can improve our pattern to find years in a given text a bit:

In [34]:
pattern = re.compile("[1-9]\d\d\d")

In [35]:
pattern.findall(txt)

['2008', '2009', '2018', '2010', '2011', '2013', '2015', '2017']

In [36]:
re.findall("[^\w\s]", txt)

['(', ')', '.', '.', '(', ')', '.', '.', '(', ')', ',', '.']

# The Backslash Plague

Let's start with an example.

Consider a text containing some Windows style directory addresses in which we have to find `C:\Windows\System32` substring.

In [37]:
txt = """
C:\Windows
C:\Python
C:\Windows\System32
"""

In [38]:
pattern = re.compile("C:\Windows\System32")

In [39]:
pattern.search(txt)

In [40]:
pattern = re.compile("C:\\Windows\\System32")

In [41]:
pattern.search(txt)

In [42]:
print("C:\\Windows\\System32")

C:\Windows\System32


### Still no match found. Why???

`\` is used as an escape at two different levels. 

- First, the Python interpreter itself performs substitutions for `\` before the `re` module ever sees the pattern string. For instance, `\n` is converted to a newline character, `\t` is converted to a tab character, etc. 

- Finally, `re` reads the substituted pattern string and will apply its own substitutions for `\` character. 

Hence, to use `\` as a **literal**, we first escape `\` with `\\` for python interpreter and then escape `\\` as `\\\\` for regex engine.

In [43]:
pattern = re.compile("C:\\\\Windows\\\\System32")

In [44]:
pattern.search(txt)

<re.Match object; span=(22, 41), match='C:\\Windows\\System32'>

### Can we use 2 backslashes instead of 4 here?

Yes. By using **raw-strings**, we do not need to put escapes at first level. 

> Python raw strings are represented as ***r"your string"***. In raw strings, no escaping is required as escape sequences like `\n`, `\t`, etc are not processed.

In [45]:
pattern = re.compile(r"C:\\Windows\\System32")

In [46]:
pattern.search(txt)

<re.Match object; span=(22, 41), match='C:\\Windows\\System32'>

### Do we really need to use 2 backslashes?

If you are **not using any metacharacters** in your regex pattern, you can use `re.escape()` method to escape all the characters in pattern except ASCII letters, numbers and '_'.

In [47]:
re.escape("C:\Windows\System32")

'C:\\\\Windows\\\\System32'

In [48]:
re.search(re.escape("C:\Windows\System32"), txt)

<re.Match object; span=(22, 41), match='C:\\Windows\\System32'>

# Alteration

Just like character classes are used to match a single character out of several possible characters, **alternation** is used to match a single regular expression out of several possible regular expressions.

This is accomplished using the pipe symbol `|`.

Consider a scenario where you want to find all occurances of `and`, `or`, `the` in a given text.

> One way is to write and execute 3 separate regular expressions. Using alteration, it can be done in a single regular expression!

In [49]:
txt = """
the most common conjunctions are and, or and but.
"""

In [50]:
pattern = re.compile("and|or|the")

In [51]:
pattern.findall(txt)

['the', 'and', 'or', 'and']

In [52]:
txt = """
What is your name?
Who is that guy?
"""

##### First method

In [53]:
pattern = re.compile("What|Who is")

##### Second Method

In [54]:
pattern = re.compile("(What|Who) is")

# Quantifiers

**Quantifiers** are the mechanisms to define how a **character**, **metacharacter**, or **character set** can be **repeated**.

Here is the list of 4 basic quantifers:

<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Symbol</th>
    <th>Name</th>
    <th>Quantification of previous character</th>
</thead>
    
<tbody>
<tr>
    <td>?</td>
    <td>Question Mark</td>
    <td>Optional (0 or 1 repetitions)</td>
</tr>
    
<tr>
    <td>*</td>
    <td>Asterisk</td>
    <td>Zero or more times</td>
</tr>

<tr>
    <td>+</td>
    <td>Plus Sign</td>
    <td>One or more times</td>
</tr>

<tr>
    <td>{n,m}</td>
    <td>Curly Braces</td>
    <td>Between n and m times</td>
</tr>
</tbody>
</table>


Let us go through different examples to understand them one by one.

### Example 1

Find all the matches for `dog` and `dogs` in the given text.

In [55]:
txt = """
I have 2 dogs. One dog is 1 year old and other one is 2 years old. Both dogs are very cute! 
"""

In [56]:
pattern = re.compile("dogs?")

In [57]:
pattern.findall(txt)

['dogs', 'dog', 'dogs']

### Example 2

Find all filenames starting with `file` and ending with `.txt` in the given text.

In [58]:
txt = """
file1.txt
file_one.txt
file.txt
fil.txt
file.xml
file-1.txt
"""

In [59]:
pattern = re.compile("file[\w-]*\.txt")

In [60]:
pattern.findall(txt)

['file1.txt', 'file_one.txt', 'file.txt', 'file-1.txt']