#### Regular expressions are a powerful language for matching text patterns.

#### The Python "re" module provides regular expression support

#### match = re.search(pat, str)

#### “re” module included with Python primarily used for string searching and manipulation
#### Also used frequently for web page “Scraping” (extract large amount of data from websites)

#### Following table lists the regular expression syntax that is available in Python −


### Pattern Description:

#### ^ ===> Matches beginning of line.

#### $  ===> Matches end of line.

#### .  ===>Matches any single character except newline. Using m option allows it to match newline as well.

#### [...]  ===>Matches any single character in brackets.

#### [^...] ===> Matches any single character not in brackets

#### re* ===> Matches 0 or more occurrences of preceding expression.

#### re+  ===>Matches 1 or more occurrence of preceding expression.

#### re? ===> Matches 0 or 1 occurrence of preceding expression.

#### re{ n} ===> Matches exactly n number of occurrences of preceding expression.

#### re{ n,} ===> Matches n or more occurrences of preceding expression.

#### re{ n, m} ===> Matches at least n and at most m occurrences of preceding expression.

#### a| b  ===>Matches either a or b.

#### \w ===>Matches word characters.
#### \W ===>Matches nonword characters.
#### \s ===>Matches whitespace. Equivalent to [\t\n\r\f].
#### \S ===>Matches nonwhitespace.
#### \d ===>Matches digits. Equivalent to [0-9].
#### \D ===>Matches nondigits.
#### \A ===>Matches beginning of string.
#### \Z ===>Matches end of string. If a newline exists, it matches just before newline.
#### \z ===>Matches end of string

#### rub[ye]===> Match "ruby" or "rube"
#### [aeiou]===> Match any one lowercase vowel
#### [0-9] ===>Match any digit; same as [0123456789]
#### [a-z] ===>Match any lowercase ASCII letter
#### [A-Z] ===>Match any uppercase ASCII letter
#### [a-zA-Z0-9] ===>Match any of the above
#### [^aeiou]===> Match anything other than a lowercase vowel
#### [^0-9] ===> Match anything other than a digit

#### . ===>Match any character except newline
#### \d ===> Match a digit: [0-9]
#### \D ===> Match a nondigit: [^0-9]
#### \s ===> Match a whitespace character: [ \t\r\n\f]
#### \S ===> Match nonwhitespace: [^ \t\r\n\f]
#### \w ===> Match a single word character: [A-Za-z0-9_]
#### \W ===> Match a nonword character: [^A-Za-z0-9_]

#### ruby? ===> Match "rub" or "ruby": the y is optional
#### ruby* ===> Match "rub" plus 0 or more ys
#### ruby+ ===> Match "rub" plus 1 or more ys
#### \d{3} ===> Match exactly 3 digits
#### \d{3,} ===>Match 3 or more digits
#### \d{3,5} ===>Match 3, 4, or 5 digits

#### [Pp]ython&\1ails ===>Match python&pails or Python&Pails
#### ['"][^\1]*\1 ===>Single or double-quoted string. \1 matches whatever the 1st group matched. \2 matches whatever the 2nd group matched, etc.

#### python|perl ===>Match "python" or "perl"
#### ruby|le) ===>Match "ruby" or "ruble"
#### Python!+|\? ===>"Python" followed by one or more ! or one ?


#### ^Python===>Match "Python" at the start of a string or internal line
#### Python$===>Match "Python" at the end of a string or line
#### \APython ===>Match "Python" at the start of a string
#### Python\Z ===>Match "Python" at the end of a string
#### \bPython\b===> Match "Python" at a word boundary
#### \brub\B \B is nonword boundary:===> match "rub" in "rube" and "ruby" but not alone
#### Python?=!===>Match "Python", if followed by an exclamation point.
#### Python?!!===>Match "Python", if not followed by an exclamation point.



In [1]:
import re
string = "Virat Kohli is one of the greatest players in the Indian cricket team.nHe was born on November 5, 1988, in Delhi.nHe has completed his education at Vishal Bharti School.nIn 2008, he won the World Cup for India on Omar’s children under 19 years. From 2011, he started Test cricket matches. nHe is currently the captain of all three formats of India.n In 2017, Virat Kohli got married to Hindi film actress Anushka Sharma.nVirat has won the Man of the Tour twice, in 2014 and 2016. nSince 2008, he has represented Delhi in-home teams. nHe has been awarded the Arjuna Award in recognition of the achievements of international cricket."

In [2]:
pattern=r'(^[V].+?)s'
print(re.match(pattern,string))      # Returns the match object
print(re.match(pattern,string).group())

<re.Match object; span=(0, 14), match='Virat Kohli is'>
Virat Kohli is


In [3]:
pattern=r'[0-9]+'
re.search(pattern,string)      # Returns the match object
print(re.search(pattern,string).group())

5


In [4]:
pattern=r'[0-9]+'
print(re.findall(pattern,string))

['5', '1988', '2008', '19', '2011', '2017', '2014', '2016', '2008']


In [5]:
pattern= 'Virat'
repl = 'Chiku'
print(re.sub(pattern, repl, string))

Chiku Kohli is one of the greatest players in the Indian cricket team.nHe was born on November 5, 1988, in Delhi.nHe has completed his education at Vishal Bharti School.nIn 2008, he won the World Cup for India on Omar’s children under 19 years. From 2011, he started Test cricket matches. nHe is currently the captain of all three formats of India.n In 2017, Chiku Kohli got married to Hindi film actress Anushka Sharma.nChiku has won the Man of the Tour twice, in 2014 and 2016. nSince 2008, he has represented Delhi in-home teams. nHe has been awarded the Arjuna Award in recognition of the achievements of international cricket.


In [6]:
import re
xx = "programokey99,education is fun"
r1 = re.findall(r"^\w+",xx)
print(r1)

['programokey99']


In [7]:
import re
xx = "programokey,education is a fun"
r1 = re.search(r"(\ba)", xx)
print(r1.span())
print((re.split(r'\s','we are splitting the words')))
print((re.split(r's','split the words')))

(25, 26)
['we', 'are', 'splitting', 'the', 'words']
['', 'plit the word', '']


In [8]:
import re

list = ["guru99 get", "guru99 give", "guru Selenium"]
for element in list:
    z = re.match("(g\w+)\W(g\w+)", element)
if z:
    print((z.groups()))
    
patterns = ['software testing', 'guru99']
text = 'software testing is fun?'
for pattern in patterns:
    print('Looking for "%s" in "%s" ->' % (pattern, text), end=' ')
    if re.search(pattern, text):
        print('found a match!')
else:
    print('no match')
abc = 'guru99@google.com, careerguru99@hotmail.com, @yahoomail.com'
emails = re.findall(r'[\w\.-]+@[\w\.-]+', abc)
for email in emails:
    print(email)

Looking for "software testing" in "software testing is fun?" -> found a match!
Looking for "guru99" in "software testing is fun?" -> no match
guru99@google.com
careerguru99@hotmail.com


In [9]:
## i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig') # found, match.group() == "piii"
## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii') # found, match.group() == "ii"
## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx') # found, match.group() == "1 2   3"
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx') # found, match.group() == "12  3"
match = re.search(r'\d\s*\d\s*\d', 'xx123xx') # found, match.group() == "123"
## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar') # not found, match == None
## but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar') # found, match.group() == "bar"

In [10]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', str)
if match:
  print(match.group())  ## 'b@google'

b@google


In [11]:
match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
  print(match.group())  ## 'alice-b@google.com'

alice-b@google.com


In [12]:
  str = 'purple alice-b@google.com monkey dishwasher'
  match = re.search(r'([\w.-]+)@([\w.-]+)', str)
  if match:
    print(match.group())   ## 'alice-b@google.com' (the whole match)
    print(match.group(1))  ## 'alice-b' (the username, group 1)
    print(match.group(2))  ## 'google.com' (the host, group 2)

alice-b@google.com
alice-b
google.com


In [13]:
  ## Suppose we have a text with many email addresses
  str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

  ## Here re.findall() returns a list of all the found email strings
  emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
  for email in emails:
    # do something with each found email string
    print(email)

alice@google.com
bob@abc.com


In [14]:
  str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
  tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)
  print(tuples)  ## [('alice', 'google.com'), ('bob', 'abc.com')]
  for tuple in tuples:
    print(tuple[0])  ## username
    print(tuple[1])  ## host

[('alice', 'google.com'), ('bob', 'abc.com')]
alice
google.com
bob
abc.com


In [15]:
  str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
  ## re.sub(pat, replacement, str) -- returns new string with all replacements,
  ## \1 is group(1), \2 group(2) in the replacement
  print(re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', str))
  ## purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher

purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher
