# Regular expression
- **Regular expressions** (RegEx)are a powerful language for matching text based on a pre-defined pattern.
- Example: finding mobile number, email id, finding URLS from HTML pages, extrating targeted content.
- The Python `re` module provides regular expression support.
  
    ```
    import re
    match = re.search(pat, str)
    ```
- The `re.search()` method takes a regular expression `pattern` and a `string` and searches for that pattern within the string.
-  If the search is successful, search() returns a match `object` or `None` otherwise.
- It can detect the presence or absence of a text by matching it with a particular pattern.
- It can also split a pattern into one or more sub-patterns.


In [6]:
import re
match = re.search(r'C....', 'GeeksforGeeks: A computer science #CS110 \
                    portal for geeks C1234')
print(match)
print(match.group(0))
  
print('Start Index:', match.start())
print('End Index:', match.end())

<re.Match object; span=(35, 40), match='CS110'>
CS110
Start Index: 35
End Index: 40


In [7]:
print(match.group(1))

IndexError: ignored

In [None]:
print(match.group(0))

In [None]:
print(match.group(0))

In [None]:
# To search for the pattern 'word:' followed by a 3 letter word
import re

str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:
  print('found', match.group()) ## 'found word:cat'
else:
  print('did not find')


found word:cat


## Basic Patterns
- a, X, 9, < -- ordinary characters just match themselves exactly. 
- The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below).

- . (a period) -- matches any single character except newline '\n'
- \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. **Note** : that although "word" is the mnemonic for this, it only matches a single word char, not a whole word.
- \W (upper case W) matches any non-word character.
- \b -- boundary between word and non-word
- \s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character. 
- \t, \n, \r -- tab, newline, return
- \d -- decimal digit [0-9] (some older regex utilities do not support \d, but they all support \w and \s)
- ^ = start, $ = end -- match the start or end of the string
- \ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can try putting a slash in front of it, \@. If its not a valid escape sequence, like \c, your python program will halt with an error. 

In [8]:
  ## Search for pattern 'iii' in string 'piiig'.
  ## All of the pattern must match, but it may appear anywhere.
  ## On success, match.group() is matched text.
  match = re.search(r'iii', 'piiig') # found, match.group() == "iii"
  print(match.group())


iii


In [9]:
  match = re.search(r'igs', 'piiig') # not found, match == None
  print(match.group())
 


AttributeError: ignored

In [10]:
## . = any char but \n
match = re.search(r'..g', 'piiig') # found, match.group() == "iig"
print(match.group())


iig


In [14]:
## \d = digit char, \w = word char
match = re.search(r'\d\d\d', 'p123g') # found, match.group() == "123"
print(match.group())

123


In [16]:
match = re.search(r'\w\w\w\w', '@@ab_cd!!') # found, match.group() == "abc"
print(match.group())

ab_c


## Repetition
Things get more interesting when you use + and * to specify repetition in the pattern

  - `+` -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
  - `*` -- 0 or more occurrences of the pattern to its left
  - `?` -- match 0 or 1 occurrences of the pattern to its left 

### Leftmost & Largest
First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible -- i.e. `+` and `*` go as far as possible (the `+` and `*` are said to be `greedy`).

### Repetition Examples

In [None]:
## i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig') # found, match.group() == "piii"
print(match.group())


piii


In [None]:
## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii') # found, match.group() == "ii"


ii


In [None]:
## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx') # found, match.group() == "1 2   3"
print(match.group())
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx') # found, match.group() == "12  3"
print(match.group())
match = re.search(r'\d\s*\d\s*\d', 'xx123xx') # found, match.group() == "123"
print(match.group())

1 2   3
12  3
123


In [21]:
## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'ibargoo') # not found, match == None
print(match.group())

AttributeError: ignored

In [22]:
## but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar') # found, match.group() == "bar"
print(match.group())


bar


### Email example

In [None]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', str)
if match:
  print(match.group())  ## 'b@google'

b@google


In [30]:
# - abc@vitstudent.ac.in, abc@vit.ac.in
str = 'purple alice-b@google.com monkey dishwasher abc@vit.ac.in dannfaoofew o'
match = re.search(r'([\w.])+@vit(student)?.ac.in', str)
if match:
  print(match.group())  ## 'b@google'

abc@vit.ac.in


#### Square brackets
- Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'.
- The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot.
- For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars which can appear around the @ with the pattern r'[\w.-]+@[\w.-]+' to get the whole email address:

In [None]:
  match = re.search(r'[\w.-]+@[\w.-]+', str)
  if match:
    print(match.group())  ## 'alice-b@google.com'

alice-b@google.com


## Begining and start of the string

In [None]:
# Beginning of String
import re
match = re.search(r'^Geek', 'Campus Geek of the month')
print('Beg. of String:', match)

match = re.search(r'^Geek', 'Geek of the month')
print('Beg. of String:', match)

# End of String
match = re.search(r'Geeks$', 'Compute science portal-GeeksforGeeks')
print('End of String:', match)

Beg. of String: None
Beg. of String: <re.Match object; span=(0, 4), match='Geek'>
End of String: <re.Match object; span=(31, 36), match='Geeks'>


In [None]:
import re
print('Any Character=>', re.search(r'p.th.n', 'pUthon 3'))

Any Character=> <re.Match object; span=(0, 6), match='pUthon'>


In [None]:
import re
print('Date{mm-dd-yyyy}:', re.search(r'(\w+\d+){1}-[\d]{2}-[\d]{4}','today date \
is18is18-08-2020'))

Date{mm-dd-yyyy}: <re.Match object; span=(11, 27), match='is18is18-08-2020'>


In [None]:
import re
print('Date{mm-dd-yyyy}:', re.search(r'(\w\d+){4}','today date \
is18is18-08-2020'))

Date{mm-dd-yyyy}: <re.Match object; span=(23, 27), match='2020'>


## ## Regular expression common methods in python
- `re.find()`
- `re.findall()`
- `re.search()`

**BeutufulSoup** : bs4 - a popular library based on re to rextract content from html, xml, etc.