# Regular Expression

A regular expression (also referred to as “regex” or “regexp”), provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. 

A regular expression is written in a formal language that can be interpreted by a regular expression processor.

> Really clever “wild card” expressions for matching and parsing strings. Smart “Find” or “Search”.

### Understanding Regular Expressions! ###

- Very powerful and quite cryptic
- Fun once you understand them
- Regular expressions are a language unto themselves
- A language of “marker characters” - programming with characters
- This tool will save your day!



```




```

![xkcd](https://imgs.xkcd.com/comics/regular_expressions.png)

### Regular Expressions Quick Guide ###

```
^        Matches the beginning of a line
$        Matches the end of the line
.        Matches any character
\s       Matches whitespace
\S       Matches any non-whitespace character
*        Repeats a character zero or more times
*?       Repeats a character zero or more times (non-greedy)
+        Repeats a character one or more times
+?       Repeats a character one or more times (non-greedy)
[aeiou]  Matches a single character in the listed set
[^XYZ]   Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
(        Indicates where string extraction is to start
)        Indicates where string extraction is to end
```

### Examples: ###

`.at` matches any three-character string ending with "at", including "hat", "cat", and "bat".

`[hc]at` matches "hat" and "cat".

`[^b]at` matches all strings matched by .at except "bat".

`[^hc]at` matches all strings matched by .at other than "hat" and "cat".

`^[hc]at` matches "hat" and "cat", but only at the beginning of the string or line.

`[hc]at$` matches "hat" and "cat", but only at the end of the string or line.

`\[.\]` matches any single character surrounded by "[" and "]" since the brackets are escaped, for example: "[a]" and "[b]".

`s.*` matches s followed by zero or more characters, for example: "s" and "saw" and "seed".


# Python: Regular Expressions module ###

* Before you can use regular expressions in your program, you must import the library using “<font color=green>**import re**</font>”
* You can use **<font color=green>re.search()</font>** to see if a string matches a regular expression, similar to using the **<font color=red>find()</font>** method for strings
* You can use **<font color=green>re.findall()</font>** to extract portions of a string that match your regular expression, similar to a combination of **<font color=red>find()</font>** and **<font color=red>slicing</font>**:  var[5:10]

In [2]:
import re

### Matching and Extracting Data! ###

When we use **re.findall()**, it returns a _list_ of zero or more sub-strings that match the regular expression


In [3]:
x = 'My 2 favorite numbers are 19 and 42'
y = re.findall('[0-9]+',x)   # [0-9] is the listed set
print(y)


['2', '19', '42']


In [3]:
x = 'MY 2 favOrite nUmbers are 19 and 42'
y = re.findall('[AEIOUY]',x)
print(y)


['Y', 'O', 'U']


### Fine-Tuning Your Match! ###

In [4]:
x = 'X-DSPAM-Result: Innocent'
y = re.findall('^X-\S+:',x)
print(y)

['X-DSPAM-Result:']


Let's analyze the regex `^X-\S+:`

```
+-------- ^ Match the start of the line (and next, lines starting with X-)
|
^X-\S+:
   ||
   ++---- \S Match any non-whitespace character ()

Now, what's different with the following?
+        Repeats a character one or more times
+?       Repeats a character one or more times (non-greedy)
*        Repeats a character zero or more times
*?       Repeats a character zero or more times (non-greedy)



```

In [5]:
x = 'X-Plane is behind schedule: two weeks'
y = re.findall('^X.*:',x)
print(y)

['X-Plane is behind schedule:']


In [6]:
x = 'X-Plane is behind schedule: two weeks'
y = re.findall('^X-.*',x)
print(y)

['X-Plane is behind schedule: two weeks']


**Wild-Card Characters**

* The **dot** character matches any character
* If you add the **asterisk** character, the character is “any number of times”


### Warning: Greedy Matching ###

The repeat characters (* and +) push outward in both directions (**greedy**) to match the largest possible string


In [7]:
x = 'From: Using the : character'
# note there are two : above

y = re.findall('^F.+:', x)
print(y)


['From: Using the :']


What happened? the **.+:** says 'any char' one or more chars, ... Last character in the match is a : 

### Non-Greedy Matching ###

Not all regular expression repeat codes are greedy! If you add a **?** character, the + and * chill out a bit... it is allowed to get lazy.


In [8]:
x = 'From: Using the : character'
# note there are two : above

y = re.findall('^F.+?:', x)
# note the use of ? 
print(y)


['From:']


What happened **NOW**? The **.+?:** says 'any char' until ... stops the first time it matches a character **:**  

Also called lazy the **?** allows the . to be lazy and stop early (when used with + and *)

### Fine-Tuning String Extraction ###

You can refine the match for **re.findall()** and separately determine which portion of the match is to be extracted by **using parentheses**.

In [9]:
x = 'From edu.dsv@kaggle.com Sat Jan  5 09:14:16 2021'
y = re.findall('\S+@\S+',x)

print(y)

['edu.dsv@kaggle.com']


```
+++-------- \S At least one non-whitespace character
|||
\S+@\S+
    |||
    +++---- \S At least one non-whitespace character



```

**Parentheses** are not part of the match - but they tell where to **start** and **stop** what string to extract

In [10]:
x = 'From edu.dsv@kaggle.com Sat Jan  5 09:14:16 2021'
y = re.findall('\S+@\S+',x)

print(y)


['edu.dsv@kaggle.com']


In [5]:
x = 'From edu.dsv@kaggle.com Sat Jan  5 09:14:16 2021'
y = re.findall('^From (\S+@\S+)',x)
# parentheses ->      |       |

print(y)


['edu.dsv@kaggle.com']


## String Parsing Examples… ##

Extracting a host name - using **find** and **string slicing**


In [12]:
data = 'From edu.dsv@kaggle.com Sat Jan  5 09:14:16 2021'
#                   |          |
#      ->           12         23 

atpos = data.find('@') # find the first occurence of @ and return its position
print(atpos)

12


In [13]:
# find the first occurence of a whitespace, starting where we found the @ and return its position
sppos = data.find(' ',atpos)
# note the 2nd param    |

print(sppos)


23


In [14]:
# slicing strings with the :
host = data[atpos+1 : sppos]
print(host)

kaggle.com



---

### Quick note on find() and search()

As seen above **data.find('@')** returns an int: the position of the first occurence of @ if found, otherwise zero.

`search()` returns true/false, and to use it similarly in the same example, you'd use:

    if re.search('@', data) :

Similarly, we fine-tune what is matched by adding special characters to the string. 

For instance, to use it as you'd use **startswith()** with strings, you would:

    if re.search('^From:', data) :

> Useful when reading a file and need to know if the line contains (**True or False**) something specific (but <font color=red>not</font> the **exact position**, initially)

---

### The Double Split Pattern ###

Sometimes we split a line one way, and then grab one of the pieces of the line and split that piece again


In [15]:
line = 'From edu.dsv@kaggle.com Sat Jan  5 09:14:16 2021'
#   ->      |                  |   |   |  |        |    -> a list of 7 elements when splitted on whitespaces.

words = line.split()        # splits on whitespaces
print(words[5])


09:14:16


In [16]:
email = words[1]            # 'edu.dsv@kaggle.com'
pieces = email.split('@')   # ['edu.dsv', 'kaggle.com']
print(pieces[1])


kaggle.com


## The Regex Version ##

In [17]:
import re 

line = 'From edu.dsv@kaggle.com Sat Jan  5 09:14:16 2021'
y = re.findall('@([^ ]*)',line)
print(y)


['kaggle.com']


```
      +-----+--   Extract the non-blank characters
      |     |
    '@([^ ]*)'
     | |  ||
     | |  |+---   Match many of them
     | |  |
     | +--+----   Match non-blank character
     |
     +---------   Look through the string until you find an '@' sign



```

## Even cooler Regex Version ##

In [4]:
import re 
line = 'From edu.dsv@kaggle.com Sat Jan  5 09:14:16 2021'
y = re.findall('^From .*@([^ ]+)',line)
print(y)

['kaggle.com']


Now, let's take a closer look at the regex:
`^From .*@([^ ]*)`

```
          +--+--- Match non-blank character
          |  |+-- Match many of them
          |++||
^From .*@([^ ]*)
|     ||||     +-- Stop extracting
|     |||+-------- Start extracting
|     |||
|     ||+--  ...looking for an @ sign
|     ++--- now, skip a bunch of characters
|
|+++++-----  ... look for the string 'From ' 
+----------  Starting at the beginning of the line



```

## Escape Character ##

If you want a special regular expression character to just behave **normally** (most of the time) you prefix it with '\\'

This allows you to **match** reserved characters and not use their special meanings in regex.


In [19]:
import re
x = 'We just received $10.00 for cookies.'
y = re.findall('\$[0-9.]+',x)
print(y)


['$10.00']


```
  \$[0-9.]+
  | |     |
  | |     +---- At least one or more
  | -++++------ A digit or period
  +------------ A real $ sign



```

> Note:
>
> &#92; escapes the next character, allowing you to match reserved characters `[ ] ( ) { } . * + ? ^ $ \`

---

# Let's go up one step in complexity

We've been playing around with email address patterns in our examples so far, but we've also been naive as they don't always work correctly in the real world.

By now you would be able to understand most of the next regex. Most, but we'll need to explain some of it.

**<\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b>**  

> In short - With the above pattern, we can search through a text file to find or verify with **this pattern** if a given string looks like an **email address**.

It looks as a complex pattern as it includes lots of things like 

    - Character Class
    - Alphabet sets
    - Percentage signs
    - Numbers
    - Underscores
    - $\{\}$, word boundaries, etc, etc...
    
The most basic regex pattern is a token just like $<b>$ -a single literal character. In the string **" Zebra is an animal."**, this will match the first $b$ in the **Ze$b$** Note that it doesn't matter whether it's present in the middle of the word as of now.

Now a few very basic things used in $<regex>$ to define itself: we will break the e-mail address pattern above piece by piece.

There are **11** metacharacters (characters with special meaning): the opening square bracket $<[>$, the backslash, the caret <^>, the dollar sign <$>, the period or dot <.>, the
vertical bar or pipe symbol <|>, the question mark <?>, the asterisk or star <*>, the plus sign <+>, the opening
round bracket <(> and the closing round bracket <)>. 

|Meta character|Description|
|:----:|----|
|**.**|<b>Period matches any single character except a line break.<b>|
|**[ ]**|<b>Character class. Matches any character contained between the square brackets.<b>|
|**[^ ]**|<b>Negated character class. Matches any character that is not contained between the square brackets <b>.|
|*****|<b>Matches 0 or more repetitions of the preceding symbol.<b>|
|**+**|<b>Matches 1 or more repetitions of the preceding symbol.<b>|
|**?**|<b>Makes the preceding symbol optional.<b>|
|**{n,m}**|<b>Braces. Matches at least "n" but not more than "m" repetitions of the preceding symbol.<b>|
|**(xyz)**|<b>Character group. Matches the characters xyz in that exact order.<b>|
|**&#124;**|<b>Alternation. Matches either the characters before or the characters after the symbol.<b>|
|**&#92;**|<b>Escapes the next character. This allows you to match reserved characters `[ ] ( ) { } . * + ? ^ $ \`.<b>| 
|**^**|<b>Matches the beginning of the input.<b>|
|**$**|<b>Matches the end of the input.<b>|

- Escape example  - If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If
you want to match **<1+1=2>**, the correct regex is $1\+1=2$. Otherwise, the plus sign will have a special meaning. **Note** that **<1+1=2>**, with the *backslash omitted*, is a **valid** regex. So you will **not** get an error message. But it
will not match **<1+1=2>**. 

# The Regex engine always returns the left-most match
This is a very important point to understand: even if a **better** match could be found later, it will always return the leftmost match, meaning the first occurence if you start from the left. 

When applying a regex to a string, the engine will start at the first character of the string. It will try all possible permutations of the regular expression at the first character.

Only if all possibilities have been tried and failed, then the engine will continue with the second character in
the text. Again, it will try all possible permutations of the regex, in exactly the same order. The result is that
the regex engine will return the leftmost match.
- When applying **<cat\>** to **He captured a catfish for his cat.**, the engine will try to match the first
token in the regex **<c\>** to the first character in the match **H**. This fails. There are no other possible
permutations of this regex, because it merely consists of a sequence of literal characters. So the regex engine
tries to match the **<c\>** with the **e**. This fails too, as does matching the **c** with the space. Arriving at the 4th
character in the match, **<c\>** <font color=red>matches</font> **c**. The engine will then try to match the second token **<a\>** to the 5th character, **a**. This <font color=red>succeeds</font> too. But then, **<t\>** <font color=red>fails</font> to match **p**. At that point, the engine knows the regex cannot be matched starting at the 4th character in the match. So it will continue with the 5th: **a**. Again, **<c\>** fails to match here and the engine carries on. At the 15th character in the match, **<c\>** again <font color=red>**matches**</font> **c**. The engine then proceeds to attempt to match the remainder of the regex at character 15 and finds that **<a\>** <font color=red>**matches**</font> **a** and **<t\>** <font color=red>**matches**</font> **t**.

- The entire regular expression could be matched starting at character 15. The engine is **"eager"** to report a
match. **It will therefore report the first three letters of catfish as a valid match**. The engine **never** proceeds
beyond this point to see if there are any **better** matches. The *first match* is considered good enough. 

# Regex's Fundamentals

### Character Sets/Classes
Character sets are also called character class. **Square brackets** are used to specify character sets. Use a **hyphen -** inside a character set to specify the characters' range. The order of the character range inside square brackets doesn't matter. For example, the regular expression `[Tt]he` means: `an uppercase T or lowercase t, followed by the letter h, followed by the letter e.`

- **<[Tt]he>** => <font color=red>The</font> car parked in <font color=red>the</font> garage.

A period inside a character set, however, means a literal period. The regular expression **<ar[.]>** means: a lowercase character a, followed by letter r, followed by a period **.** character.

- **<ar[.]>** => A garage is a good place to park a c<font color=red>ar.</font>

- **<[0-9]>** => Matches a **single digit between 0 and 9**. You can use more than one range.
- **<[0-9a-fA-F]>** => Matches a **single hexadecimal digit**, case insensitively. 
- You can combine ranges and single characters. **<[0-9a-fxA-FX]>** matches a hexadecimal digit or the letter X.* Again, the order of the characters and the ranges does not matter.*
- Find a word, even if it is misspelled, such as **<sep[ae]r[ae]te>** or **<li[cs]en[cs]e>**. 

### Negated Character Sets/Classes

Typing a **caret(^)** after the opening square bracket will negate the character class. **The result is that the character
class will match any character that is <font color = red>not </font> in the character class.**
- It is important to remember that a negated character class **still must match a character**. **<q[^u]>** does not
mean: **<font color= red> a q not followed by a u </font>**. It means: **<font color= red> a q followed by a character that is not a u </font>**. It will **not** match the
$q$ in the string $Iraq$. It will match the $q$ and $the\ space$ after the $q$ in **$Iraq\ is\ a\ country$**.

###  Shorthand Character Sets

Regular expression provides **shorthands** for the commonly used character sets,
which offer **convenient shorthands** for commonly used regular expressions. The
shorthand character sets are as follows:

|Shorthand|Description|
|:----:|----|
|<b>.<b>|<b>Any character except new line. It's the most commonly misused metacharacter.<b>|
|<b>\w<b>|<b>Matches alphanumeric characters: `[a-zA-Z0-9_]`<b>|
|<b>\W<b>|<b>Matches non-alphanumeric characters: `[^\w]`<b>|
|<b>\d<b>|<b>Matches digit: `[0-9]`<b>|
|<b>\D<b>|<b>Matches non-digit: `[^\d]`<b>|
|<b>\s<b>|<b>Matches whitespace character: `[\t\n\f\r\p{Z}]`<b>|
|<b>\S<b>|<b>Matches non-whitespace character: `[^\s]`<b>|
    
## Repetitions

The following metacharacters `+`, `*` or `?` are used to specify how many times a
subpattern can occur. These meta characters act differently in different
situations.

### The Star *

The symbol `*` matches zero or more repetitions of the preceding matcher. The
regular expression `a*` means: zero or more repetitions of preceding lowercase
character `a`. But if it appears after a character set or class then it finds
the repetitions of the whole character set. 
For example, the regular expression
- `[a-z]*` means: any number of lowercase letters in a row.

The `*` symbol can be used with the meta character `.` to match any string of
characters `.*`. The `*` symbol can be used with the whitespace character `\s`
to match a string of whitespace characters. For example, the expression
`\s*cat\s*` means: zero or more spaces, followed by lowercase character `c`,
followed by lowercase character `a`, followed by lowercase character `t`,
followed by zero or more spaces.

### The Plus +

The symbol `+` matches one or more repetitions of the preceding character. For
example, the regular expression `c.+t` means: lowercase letter `c`, followed by
at least one character, followed by the lowercase character `t`. It needs to be
clarified that `t` is the last `t` in the sentence.

- **<c.+t>** => The fat <font color='red'> cat sat on the mat</font>.

### The Question Mark ?
In regular expression the meta character `?` makes the preceding character optional. This symbol matches zero or one instance of the preceding character. For example, the regular expression `[T]?he` means: `Optional the uppercase letter T, followed by the lowercase character h, followed by the lowercase character e.`

- **<[Tt]he>** => <font color=red>The</font> car parked in <font color=red>the</font> garage.

### The Lazy Star *? 

Repeats the previous item zero or more times. Lazy, so the engine first attempts to skip the
previous item, before trying permutations with ever increasing matches of the preceding
item. 

|Regex|Means|
|:----:|----|
|abc+|        matches a string that has ab followed by one or more c|
|abc?|       matches a string that has ab followed by zero or one c|
|abc{2}|      matches a string that has ab followed by 2 c|
|abc{2,}|    matches a string that has ab followed by 2 or more c|
|abc{2,5}|    matches a string that has ab followed by 2 up to 5 c|
|a(bc)\*|     matches a string that has a followed by zero or more copies of the sequence bc|
|a(bc){2,5}|  matches a string that has a followed by 2 up to 5 copies of the sequence bc|
|**<.+>**| matches `<div>simple div</div>`|

## Full stop or Period or dot **.**

In regular expressions, the dot or period is one of the most commonly <font color = 'Blue'>used</font> metacharacters. Unfortunately, it is also the most commonly <font color = 'Red'>misused</font> metacharacter. The dot is short for the negated character class **<[^\n]>** (UNIX regex flavors) or
**<[^\r\n]>** (Windows regex flavors).

<font color = 'Red'> <b> Use The Dot Sparingly </b> </font>
- The dot is a **very powerful** regex metacharacter. It allows you to be **lazy**. `Put in a dot, and everything will
match just fine when you test the regex on valid data. The problem is that the regex will also match in cases
where it should not match.`
    
Example - Let’s say we want to match a date in `mm/dd/yy` format, but we
want to leave the user the choice of date separators. The quick solution is **<\d\d.\d\d.\d\d>**. Seems fine at
first sight.. It will match a date like `02/12/03` just what we intended, So fine... 
- <font color='red'> <b> Trouble is: 02512703<b></font> is also considered a **valid date** by this regular expression. In this match, the first dot matched $5$, and the second matched $7$. Obviously $not$ what we intended. 
    
## Dollars and Carets: Start of String and End of String Anchors ( $ and ^)

Anchors are a different breed. They do not match any character at all. Instead, they match a position before,
after or between characters. They can be used to `anchor` the regex match at a certain position. 
- The caret **<^>**
matches the position before the first character in the string. Applying **<^a>** to `abc` matches `a`. **<^b>** will
not match `abc` at all, because the **<b\>** cannot be matched right after the start of the string, matched by **<^>**.
- Similarly, **<\$>** matches right after the last character in the string. **<c\$>** matches `c` in `abc`, while **<a\$>** `does not` match `abc` at all....

# Examples

In the following: Regex are written `highlighted` and the String to be matched is in "**bold**"

Suppose you want to use a regex to match a list of function names in a programming language: "**Get, GetValue, Set or SetValue.**" 
- The obvious solution is `Get|GetValue|Set|SetValue`

*Now take a look closer carefully at the regex and the string, both.
Here are some other ways to do the same task*
-  `Get(Value)?|Set(Value)?`
-  `\b(Get|GetValue|Set|SetValue)\b`
-  `\b(Get(Value)?|Set(Value)?)\b`
- Even this one is correct `\b(Get|Set)(Value)?\b`

**Regex**:	`<[^>]+>`
- **What it does**:	This finds any HTML, such as `<\a>, <\b>, <\img />, <\br />, etc`. You can use this to find segments that have HTML tags you need to deal with, or to remove all HTML tags from a text.
    
**Regex**:	`https?:\/\/[\w\.\/\-?=&%,]+`
- What it does:	This will find a URL. It will capture most URLs that begin with http:// or https://.

**Regex**:	`'\w+?'`
- **What it does**:	This finds single words that are surrounded by apostrophes.

**Regex**:	`([-A-Za-z0-9_]*?([-A-Za-z_][0-9]|[0-9][-A-Za-z_])[-A-Za-z0-9_]*)`
- **What it does**:	Alphanumeric part numbers and references like: 1111_A, AA1AAA or 1-1-1-A, 21A1 and 10UC10P-BACW, abcd-1234, 1234-pqtJK, sft-0021 or 21-1_AB and 55A or AK7_GY.
This can be very useful if you are translating documents that have a lot of alphanumeric codes or references in them, and you need to be able to find them easily.

**Regex**:	`\b(the|The)\b.*?\b(?=\W?\b(is|are|was|can|shall| must|that|which|about|by|at|if|when|should|among|above|under|$)\b)`
- **What it does**:	This finds text that begins with the or The and ends with stop words such as is, are, was, can, shall, must, that, which, about, by, at, if, when, should, among, above or under, or the end of the segment.
This is particularly useful when you need to extract terminology. Suppose you have segments like these:
`
The Web based look up is our new feature. A project manager should not proofread... Our Product Name is...`
    - The Regex shown above would find anything between The and is, or should. With most texts, there is a good chance that anything this Regex finds is a good term that you can add to your Termbase.

**Regex**:	`\b(a|an|A|An)\b.*?\b(?=\W?\b(is|are|was|can|shall|must |that|which|about|by|at|if|when|among|above|under|$)\b)`
- **What it does**:	This works much like the Regex shown above, except that it finds text that begins with a or an, rather than the. This can also be very helpful when you need to extract terminology from a project.

**Regex**:  `\b(this|these|This|These)\b.*?\b(?=\W?\b(is|are|was|can|shall|must|that|which|about|by|at|if|when|among|above|under|$)\b)`
    - **What it does**:	This works much like the Regex shown above, except that it finds text that begins with this or these. This can also be very helpful when you need to extract terminology from a project.

**Regex** :`(.*?)`
- **What it does** : Accept blah-blah-blah...



---

## Python [re module](https://docs.python.org/3/library/re.html)

- `re.sub(regex, replacement, subject)` performs a search-and-replace across subject, replacing all
matches of regex in subject with replacement. The result is returned by the sub() function. **The subject
string you pass is not modified**. The re.sub() function applies the same backslash logic to the replacement text as is applied to the regular expression. Therefore, you should use raw strings for the replacement text...

Let's go back to some Python code and put to the test some regex we just learned.

In [6]:
import re

tweet = '#fingerprint #Pregnancy Test https://goo.gl/h1MfQV #android +#apps +#beautiful \
         #cute #health #igers #iphoneonly #iphonesia #iphone \
             <3 ;D :( :-('

#Let's take care of emojis and the #(hash-tags)...

print(f'Original Tweet ---- \n {tweet}')

## Replacing #hashtag with only hashtag
tweet = re.sub(r'#(\S+)', r' \1 ', tweet)
#this gets a bit technical as here we are using Backreferencing and Character Sets Shorthands and replacing the captured Group.
#\S = [^\s] Matches any charachter that isn't white space
print(f'\n Tweet after replacing hashtags ----\n  {tweet}')

## Love -- <3, :*
tweet = re.sub(r'(<3|:\*)', ' EMO_POS ', tweet)
print(f'\n Tweet after replacing Emojis for Love with EMP_POS ----\n  {tweet}')

#The parentheses are for Grouping, so we search (remeber the raw string (`r`))
#either for <3 or(|) :\* (as * is a meta character, so preceeded by the backslash)

## Wink -- ;-), ;), ;-D, ;D, (;,  (-;
tweet = re.sub(r'(;-?\)|;-?D|\(-?;)', ' EMO_POS ', tweet)
print(f'\n Tweet after replacing Emojis for Wink with EMP_POS ----\n  {tweet}')

#The parentheses are for Grouping as usual, then we first focus on `;-), ;),`, so we can see that 1st we need to have a ;
#and then we can either have a `-` or nothing, so we can do this via using our `?` clubbed with `;` and hence we have the very
#starting with `(;-?\)` and simarly for others...

## Sad -- :-(, : (, :(, ):, )-:
tweet = re.sub(r'(:\s?\(|:-\(|\)\s?:|\)-:)', ' EMO_NEG ', tweet)
print(f'\n Tweet after replacing Emojis for Sad with EMP_NEG ----\n  {tweet}')

Original Tweet ---- 
 #fingerprint #Pregnancy Test https://goo.gl/h1MfQV #android +#apps +#beautiful          #cute #health #igers #iphoneonly #iphonesia #iphone              <3 ;D :( :-(

 Tweet after replacing hashtags ----
   fingerprint   Pregnancy  Test https://goo.gl/h1MfQV  android  + apps  + beautiful            cute   health   igers   iphoneonly   iphonesia   iphone               <3 ;D :( :-(

 Tweet after replacing Emojis for Love with EMP_POS ----
   fingerprint   Pregnancy  Test https://goo.gl/h1MfQV  android  + apps  + beautiful            cute   health   igers   iphoneonly   iphonesia   iphone                EMO_POS  ;D :( :-(

 Tweet after replacing Emojis for Wink with EMP_POS ----
   fingerprint   Pregnancy  Test https://goo.gl/h1MfQV  android  + apps  + beautiful            cute   health   igers   iphoneonly   iphonesia   iphone                EMO_POS   EMO_POS  :( :-(

 Tweet after replacing Emojis for Sad with EMP_NEG ----
   fingerprint   Pregnancy  Test https://go

In [7]:
##See the Output Carefully, there are Spaces inbetween un-necessary...
## Replace multiple spaces with a single space
tweet = re.sub(r'\s+', ' ', tweet)
print(f'\n Tweet after replacing xtra spaces ----\n  {tweet}')
      
##Replace the Puctuations (+,;) 
tweet = re.sub(r'[^\w\s]','',tweet)
print(f'\n Tweet after replacing Punctuation + with PUNC ----\n  {tweet}')


 Tweet after replacing xtra spaces ----
   fingerprint Pregnancy Test https://goo.gl/h1MfQV android + apps + beautiful cute health igers iphoneonly iphonesia iphone EMO_POS EMO_POS EMO_NEG EMO_NEG 

 Tweet after replacing Punctuation + with PUNC ----
   fingerprint Pregnancy Test httpsgooglh1MfQV android  apps  beautiful cute health igers iphoneonly iphonesia iphone EMO_POS EMO_POS EMO_NEG EMO_NEG 


In [22]:
# bags of positive/negative smiles (You can extend the above example to take care of these few too...))) A good Excercise...

positive_emojis = set([
":‑)",":)",":-]",":]",":-3",":3",":->",":>","8-)","8)",":-}",":}",":o)",":c)",":^)","=]","=)",":‑D",":D","8‑D","8D",
"x‑D","xD","X‑D","XD","=D","=3","B^D",":-))",";‑)",";)","*-)","*)",";‑]",";]",";^)",":‑,",";D",":‑P",":P","X‑P","XP",
"x‑p","xp",":‑p",":p",":‑Þ",":Þ",":‑þ",":þ",":‑b",":b","d:","=p",">:P", ":'‑)", ":')",  ":-*", ":*", ":×"
])
negative_emojis = set([
":‑(",":(",":‑c",":c",":‑<",":<",":‑[",":[",":-||",">:[",":{",":@",">:(","D‑':","D:<","D:","D8","D;","D=","DX",":‑/",
":/",":‑.",'>:\\', ">:/", ":\\", "=/" ,"=\\", ":L", "=L",":S",":‑|",":|","|‑O","<:‑|"
])

# del positive_emojis, negative_emojis

In [20]:
## Valid Dates..
pattern = r'(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])' 

- matches a date in yyyy-mm-dd format from between 1900-01-01 and 2099-12-31, with a choice of four separators(space included :))

- The year is matched by `(19|20)\d\d`
- The month is matched by `(0[1-9]|1[012])` (rounding brackets are necessary so that to include both the options)
    - By using character classes, 
        - the first option matches a number between `01 and 09`, and 
        - the second matches `10, 11 or 12`
- The last part of the regex consists of three options. The first matches the numbers `01
through 09`, the second `10 through 29`, and the third matches `30 or 31`... 

In [24]:
## Pattern to match any IP Addresses 
pattern = r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b'

the above pattern will also match `999.999.999.999` but that isn't a valid IP at all
Now this depends on the data at hand as to how far you want the regex to be accurate...
To restrict all `4` numbers in the IP address to `0..255`, you can use this
complex beast: 
- `\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-
9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-
4][0-9]|[01]?[0-9][0-9]?)\b`

---

# References
This was a short intro, you can explore more on the topic of RE in the following resources:

- [Wikipedia](http://en.wikipedia.org/wiki/Regular_expression)
- [rexegg](http://www.rexegg.com/regex-quickstart.html#chars)
- [greenend](http://www.greenend.org.uk/rjk/tech/regexp.html)


# More references...  #


- http://www.rexegg.com/ 
- https://github.com/aloisdg/awesome-regex
- http://linuxreviews.org/beginner/tao_of_regular_expressions/tao_of_regular_expressions.en.print.pdf
- https://developers.google.com/edu/python/regular-expressions
- https://www.youtube.com/watch?v=EkluES9Rvak

PS I wrote (above things) for the [Amazing Course](https://mlcourse.ai/) which is maintained by [@kashnitsky](https://www.kaggle.com/kashnitsky),[@artgor](https://www.kaggle.com/artgor) , [@datamove](https://www.kaggle.com/metadist) etc. and many many many other amazing peoples from ODS who framed the amazing course;

You can find the source nbs [here](https://github.com/Yorko/mlcourse.ai/tree/master/jupyter_english/tutorials) and a lot of very cool stuffs there as well.