### Regex - Regular Expressions

What we will cover:

1. What are regular expressions? How are they different from string find or index method?
2. Why do we use r' (raw string) while creating our patterns to match? \ the escape character
3. Some functions in regex library : findall, finditer, search, match, compile
4. Match objects and methods on match objects
5. Special sequences
6. Metacharacters
7. Quantifiers
8. Flags
9. Some more regular expression functions

In [1]:
''' Yashoda is the VP of Marketing     yashoda@gmail.com
noorjahan@yahoo.com CEO of XYZ company
abhijit@outlook.com  Owns his own startup in the AI space
arun@hotmail.com   jkagakjg
a;iugagj   afrin@rediffmail.com'''


' Yashoda is the VP of Marketing     yashoda@gmail.com\nnoorjahan@yahoo.com CEO of XYZ company\nabhijit@outlook.com  Owns his own startup in the AI space\narun@hotmail.com   jkagakjg\na;iugagj   afrin@rediffmail.com'

In [None]:
# Many times, we need to extract required information from given text data. For example, we want to know the number of
# people who contacted us in the last month through Gmail or we want to know the phone numbers of employees in a company 
# whose names start with 'A' or we want to retrieve the date of births of the patients in a hospital who joined for
# treatment for hypertension, etc. To get such information, we have to conduct a searching operation on the text data. Once
# the required information is found, we may have to perform further operations on such data. Regular expressions are useful
# to perform such operations on data.

In [1]:
# Regular Expressions
# A regular expression is a string that contains special symbols and characters to find and extract the information needed
# by us from the given data. 

# Where a string method in Python to search for a substring in a string would look like this:

input_str = 'Betty bought some butter but the butter was bitter so Betty bought some better butter to make\
the bitter butter better'

sub_str = 'bought'

input_str.find(sub_str)


6

In [2]:
import re

sub_re = 'b\w+'

result = re.findall(sub_re, input_str)
print(result)

['bought', 'butter', 'but', 'butter', 'bitter', 'bought', 'better', 'butter', 'bitter', 'butter', 'better']


In [None]:
print(len(result))

# We can use the findall method of the re module to look for all the occurrences of 'b'. 

In [3]:
print(dir(re))

['A', 'ASCII', 'DEBUG', 'DOTALL', 'I', 'IGNORECASE', 'L', 'LOCALE', 'M', 'MULTILINE', 'Match', 'Pattern', 'RegexFlag', 'S', 'Scanner', 'T', 'TEMPLATE', 'U', 'UNICODE', 'VERBOSE', 'X', '_MAXCACHE', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '__version__', '_cache', '_compile', '_compile_repl', '_expand', '_locale', '_pickle', '_special_chars_map', '_subx', 'compile', 'copyreg', 'enum', 'error', 'escape', 'findall', 'finditer', 'fullmatch', 'functools', 'match', 'purge', 'search', 'split', 'sre_compile', 'sre_parse', 'sub', 'subn', 'template']


In [8]:
sub_re = re.compile('b\w+')

result = re.findall(re.compile('b\w+'), input_str)
print(result)

['bought', 'butter', 'but', 'butter', 'bitter', 'bought', 'better', 'butter', 'bitter', 'butter', 'better']


In [7]:
print(type(sub_re))

<class 're.Pattern'>


In [9]:
result2 = sub_re.findall(input_str)

print(result2)

['bought', 'butter', 'but', 'butter', 'bitter', 'bought', 'better', 'butter', 'bitter', 'butter', 'better']


In [10]:
# search, finditer and match - all these ouput 'match objects'

# search will only find the first occurrence of the pattern match

In [11]:
print(input_str)

Betty bought some butter but the butter was bitter so Betty bought some better butter to makethe bitter butter better


In [12]:
sub_re = 'b\w+'

result = re.search(sub_re, input_str)

print(result)

<re.Match object; span=(6, 12), match='bought'>


In [13]:
# Attributes/methods of a match object. 

print(result.span())

(6, 12)


In [14]:
print(result.start())

6


In [15]:
print(result.end())

12


In [16]:
print(result.string)

Betty bought some butter but the butter was bitter so Betty bought some better butter to makethe bitter butter better


In [18]:
print(result.group())

bought


In [17]:
print(dir(result))

['__class__', '__class_getitem__', '__copy__', '__deepcopy__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'end', 'endpos', 'expand', 'group', 'groupdict', 'groups', 'lastgroup', 'lastindex', 'pos', 're', 'regs', 'span', 'start', 'string']


In [None]:
# match method will only find the match at the beginning of the string

In [19]:
print(input_str)

Betty bought some butter but the butter was bitter so Betty bought some better butter to makethe bitter butter better


In [25]:
input_str = 'Betty bought some butter but the butter was bitter so Betty bought some better butter to makethe bitter butter better'

sub_re = 'B\w+'

result = re.match(sub_re, input_str)

print(result)

print(result.span())
print(result.group())

<re.Match object; span=(0, 5), match='Betty'>
(0, 5)
Betty


In [33]:
# finditer method - finds ALL the matches of the patter but returns iter object with the match objects in it.

input_str = 'Betty bought some butter but the butter was bitter so Betty bought some better butter to makethe bitter butter better'

sub_re = 'b\w+'

result = re.finditer(sub_re, input_str)

print(result)


<callable_iterator object at 0x000001F74BD511F0>


In [32]:
for everymatch in result:
    print(everymatch.start(), everymatch.group())
    

6 bought
18 butter
25 but
33 butter
44 bitter
60 bought
72 better
79 butter
97 bitter
104 butter
111 better


In [35]:
print(input_str)

Betty bought some butter but the butter was bitter so Betty bought some better butter to makethe bitter butter better


In [42]:
input_str = 'Betty bought some butterbut the butter was bitter so Betty bought some better butter to makethe bitter butter better'

sub_re = r'b\w+'

result = re.findall(sub_re, input_str)

print(result)

['bought', 'butterbut', 'butter', 'bitter', 'bought', 'better', 'butter', 'bitter', 'butter', 'better']


In [None]:
# RegEx always moves character by character to find the match and performs a 'greedy' match. 

In [None]:
# However, if we wanted to find out all the occurrences of b - whether small or capital, we would have to do some
# manipulations to get the desired result. Regex gives us tools to handle these queries and operations in a much simpler
# manner. 

input_str = 'Betty bought some butterbut the butter was bitter so Betty bought some better butter to make\
the bitter butter better'

sub_re = 'b\w+'
result = re.findall(sub_re, input_str)
print(result)
print(len(result))


In [None]:
# Note here how the capital B was also returned in the result. We shall see the other available methods in regex module 
# shortly.

In [None]:
# A regular expression helps us to search match, find and split based on specified patterns as per
# our requirements. A regular expression is also called simply regex. Regular expressions are available in many languages
# besides Python. 


# Python provides re module that stands for regular expressions. This module contains methods
# like compile(), search(), match(), findall(), split(), etc. which are used in finding the information in
# the available data. So, when we write a regular expression, we should import re module as:

import re

#### The re module has several methods to help us write regex. 

search - returns a match object if the substring is matched in the string to be searched. It returns only the first
occurrence of the match.

findall - returns a list containing all matches

split - returns a list where string has been split at each pattern match. 

sub - replaces one or many pattern matches with a specified string. 

As well as other methods which we shall see in a bit.

In [None]:
import re

input_str = 'Betty bought some butter but the butter was bitter so Betty bought some better butter to make\
the bitter butter better'

sub_re = '[bB]\w+'
result = re.findall(sub_re, input_str)


In [None]:
print(result)

In [None]:
#compile method

In [None]:
sub_re = '[bB]\w+'

print(sub_re)

result = re.findall(re.compile(sub_re), input_str)\

print(result)

In [None]:
result = sub_re.findall(input_str)

print(result)

In [None]:
print(input_str)

In [None]:
input_str = 'Betty bought some butter but the butter was bitter so Betty bought some better butter to make the bitter butter better.'

In [None]:
sub_re = r'\bt\w+'

searchobj = re.findall(sub_re, input_str)

print(searchobj)

In [None]:
print(searchobj.span())

In [None]:
print(searchobj.group())

In [None]:
print(searchobj.start())
print(searchobj.end())

In [None]:
print(searchobj.string)

In [None]:
sub_re = r'b\w+'

matchobj = re.match(sub_re, input_str)

print(matchobj)

In [None]:
print(matchobj.span())

In [None]:
finditerobj = list(re.finditer(sub_re, input_str))

print(finditerobj)

In [None]:
input_str[6:12]

In [None]:
for each in finditerobj:
    print(each.group())

In [None]:
# While going forward - it is important to remember that the RegEx module works character by character from left to right 
# i.e. continues matching the pattern and keeps going on as long as the conditions for matching are continuing to be
# satisfied (or not satisfied depending on how the regex pattern is written). You shall see examples of this later in the
# class.

In [None]:
* - 0 to infinity occurrences of the preceding sequence or character
+ - 1 to infinity occurrences of the preceding sequence or character

In [None]:
input_str = 'a b bx bxx bxyz'

sub_str = r'b\w+'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
* = Whatever character preceding the star - look for 0 to infinity occurrences
+ = Whatever character preceding the plus - look for 1 to infinity occurrences


In [None]:
input_str = 'buuuuuutter btter butter'

sub_str = r'bu+tter'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
\A = ^ means beginning of string

\Z = $ - means at end of string




In [None]:
# List of special sequences. A special sequence is a \ followed by one of the characters from list below and each special
# sequence has a special meaning.

# Special Sequence             Description
# \A                           Matches if the string begins with the given pattern

# \b                           Matches if the word begins or ends with the given character.(\b before pattern to check if it
#                              begins with the pattern and \b after pattern to see if it ends with the specified pattern).
# \B                           It is the opposite of the \b i.e. the word should not start or end with the given regex.
# \d                           Matches any decimal digit, this is equivalent to the set class [0-9]
# \D                           Matches any non-digit character, this is equivalent to the set class [^0-9]
# \s                           Matches any whitespace character.
# \S                           Matches any non-whitespace character
# \w                           Matches any alphanumeric character, this is equivalent to the class [a-zA-Z0-9_].
# \W                           Matches any non-alphanumeric character.
# \Z                           Matches if the string ends with the given regex

In [None]:
* - 0 or more occurrences of the character or sequence preceding

a* = 0 or more occurrences of a
\w* = 0 or more ocurrences of \w alphanumeric character

+ - 1 or more occurrences of the character or sequence preceding it.

a+ - 1 or more occurrences of a
\w+ - 1 or more occurrences of alphanumeric character.

In [52]:
import re

input_str = 'Betty b0ught some butt3r but the bu!tb3rrrr w@s b!tbt3r s0 Betty b0u6ht s0me Bett3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r'

In [53]:
sub_re = r'\AB\w+'

result = re.findall(sub_re, input_str)

print(result)

['Betty']


In [55]:
sub_str = r'\w+r\Z'

result = re.findall(sub_str, input_str)

print(result)

['b3tt3r']


In [59]:
sub_str = r'\w+t\b'

result = re.findall(sub_str,input_str)

print(result)

['b0ught', 'but', 'b0u6ht']


In [60]:
#\B - Means the pattern should be matched BUT should not be at the beginning of the string 

# IT DOES NOT MEAN

# Match any string where the given pattern is not at the beginning. 

In [61]:
print(input_str)

Betty b0ught some butt3r but the bu!tb3rrrr w@s b!tbt3r s0 Betty b0u6ht s0me Bett3r butt3r to m@k3 the b!tt3rbutt3r b3tt3r


In [None]:
\b is either beginning or end of word
\B is NOT at beginning or end of word

In [63]:
input_str = 'Betty b0ught some butt3r but the bu!tb3rrrr w@s b!tbt3r s0 Betty b0u6ht s0me Bett3r butt3r to m@k3 the b!tt3rbutt3r b3tt3r'

sub_re = '\w+t\B'

result = re.findall(sub_re, input_str)

print(result)

['Bett', 'butt', 'tbt', 'Bett', 'Bett', 'butt', 'tt3rbutt', 'b3tt']


In [None]:
# \d - for any digit (0~9), \D - Non digit

In [64]:
sub_re = 'b\d\w+'

result = re.findall(sub_re, input_str)

print(result)

['b0ught', 'b3rrrr', 'b0u6ht', 'b3tt3r']


In [65]:
sub_re = '\w+\D'

result = re.findall(sub_re, input_str)

print(result)

['Betty ', 'b0ught ', 'some ', 'butt3r ', 'but ', 'the ', 'bu!', 'tb3rrrr ', 'w@', 's ', 'b!', 'tbt3r ', 's0 ', 'Betty ', 'b0u6ht ', 's0me ', 'Bett3r ', 'butt3r ', 'to ', 'm@', 'k3 ', 'the ', 'b!', 'tt3rbutt3r ', 'b3tt3r']


In [66]:
print(input_str)

Betty b0ught some butt3r but the bu!tb3rrrr w@s b!tbt3r s0 Betty b0u6ht s0me Bett3r butt3r to m@k3 the b!tt3rbutt3r b3tt3r


In [67]:
input_str = 'Aayush Nanda CEO, Sankara Narayan VP Finance, Raja Imitha CTO, Harsha Singh Founder/Owner, Sravanthi Reddy CFO'

In [72]:
sub_re = '\w+\s\w+'

result = re.findall(sub_re, input_str)

print(result)

['Aayush Nanda', 'Sankara Narayan', 'VP Finance', 'Raja Imitha', 'Harsha Singh', 'Sravanthi Reddy']


In [None]:
# \w - Alphanumeric

In [73]:
# \W - Non-Alphanumeric


input_str = 'Betty b0ught some butt3r but the bu!tb3rrrr w@s b!tbt3r s0 Betty b0u6ht s0me Bett3r butt3r to m@k3 the b!tt3rbutt3r b3tt3r'
sub_re = r'\W+'

result = re.findall(sub_re, input_str)

print(result)

[' ', ' ', ' ', ' ', ' ', ' ', '!', ' ', '@', ' ', '!', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '@', ' ', ' ', '!', ' ']


In [None]:
input_str1 = '''Line 1
Hello there
My name is Khan'''

templist = input_str1.splitlines()

print(templist)

In [None]:
for x in templist:
    print(re.findall(r'\A\w+', x))

In [None]:
sub_str = r'\A\w+'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
print(input_str)

In [None]:
input_str = '''Betty b0ught some butt3r but the bu!tb3rrrr w@s b!tbt3r s0 Betty b0u6ht s0me Bett3r butt3r to m@k3
the b!tt3r butter better'''


In [None]:
sub_str = r'\B'

result = re.findall(sub_str, input_str)

In [None]:
\B - Means the pattern should be matched BUT should not be at the beginning of the string 

IT DOES NOT MEAN

Match any string where the given pattern is not at the beginning. 

In [None]:
print(result)

In [None]:
print(input_str)

In [None]:
input_str = '''Betty b0ught some butst3r but the butb3rrrr w@s b!tbt3r s0 Betty b0u6ht s0me Bett3r butt3r to m@k3
the b!tt3r butter better'''

In [None]:
sub_str = r'\Bb\w+'

result = re.findall(sub_str, input_str)
print(result)

In [None]:
sub_str = r'\W\w+'
result = re.findall(sub_str, input_str)

print(result)

In [None]:
print(input_str)

In [None]:
#\B - Pattern should be matched but should NOT be at beginning of the word

#\B - DOES NOT mean - any word that does not have the pattern at the beginning of the word.

In [None]:
print(input_str)

In [None]:
input_str = '''Betty b0ught somebody butst3r but the butb3rrrr w@s b!tbt3r s0 Betty b0u6ht s0me Bett3r butt3r to m@k3
the b!tt3r butter better'''

In [None]:
sub_str = r'\Bb\w+'

result = re.findall(sub_str, input_str)
print(result)

# The findall function takes the two parameters, substring and the string to be searched. It returns the matches in a list 
# in the order they are found. If no matches are found, it returns an empty list.

In [None]:
print(input_str)

In [None]:
sub_str = r'\D+'

result = re.findall(sub_str, input_str)
print(result)

In [None]:
print(input_str)

In [None]:
#\A Returns a match if the specified characters are at the beginning of the string(NOT words but the whole string)
sub_str = r'\AB\w+'

print(re.findall(sub_str, input_str))

In [None]:
input_str = r'''Betty b0u6ht 5ome cutt3r but the betbt3r w@s \n xyzb tbt3r s0 Betty!b0u6ht \n s0me Bett3r butt3r to m@k3 the b!tt3r  
butt3r b3tter'''

In [None]:
print(input_str)

In [None]:
input_str = 'The Terminator The Avengers The Invincibles The Minions Lion King'

In [None]:
sub_str = r'\W+'


result = re.findall(sub_str, input_str)

print(result)

In [None]:
sub_str = r'y\Sb0u'

print(type(sub_str))

result = re.findall(sub_str, input_str)
      
print(result)

In [None]:
print(input_str)
print(sub_str)

In [None]:
result = re.search(sub_str, input_str)

print(result)

print(dir(result))

In [None]:
for i in result:
    print(i.span())

In [None]:
# Note here how we started using r denoting (raw-string) before the regex? This is because in Regex \ is used in front of
# many shorthand notations while \ is also an escape character in Python. To avoid conflict, we always put regular
# expressions to be searched in r format. 

In [None]:
input_str = 'betty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me bett3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r'

sub_str = '\w+3r\Z'

result = re.findall(sub_str, input_str)

print(result)
#print(result.group())

#Note the result when not putting the r rawstring

In [None]:
#Workaround

sub_str = r'\bbu\w+'

result = re.search(sub_str, input_str)

print(result)
print(result.group())

In [None]:
#Easiest way

sub_str = r'\bbu\w+'

result = re.search(sub_str, input_str)

print(result)
print(result.group())

In [None]:
print(input_str)

In [None]:
input_str = 'betty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me bett3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r'

sub_str = r'\bbu\w+'


result = re.findall(sub_str, input_str)

print(result)

In [None]:
result = re.search(sub_str, input_str)

print(result)

In [None]:
print(result.span())

In [None]:
print(result.start())

In [None]:
print(result.end())

In [None]:
print(result.group())

In [None]:
print(result.string)

In [None]:
sub_str = r'be\w+'

In [None]:
print(input_str)

In [None]:
input_str = 'betty b0bu6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me bett3r butt3r to m@k3 the b!tt3rbutt3r b3tt3r'

In [None]:
sub_str = r'\bbu\w+'

In [None]:
result = re.match(sub_str, input_str)
result1 = re.search(sub_str, input_str)

print(result)
print(result1)

In [None]:
result = re.finditer(sub_str, input_str)

print(result)

In [None]:
for match in result:
    print(match.span(), ' ', match.group())

In [None]:
input_str = 'Betty b0u6ht some butt3r but the bu!tb3rrrr w@s b!tbt3r s0 Betty b0u6ht s0me Bett3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

In [None]:
\B - Match this pattern BUT not at the beginning (or end) of word. 
\B - NOT ------> Match anything that does not have this pattern.

In [None]:
sub_str = r'\D+3r.\Z'

result = re.findall(sub_str, input_str)
print(result)

In [None]:
sub_str = re.compile(r'\bbu\w+')

print(sub_str)

result = re.findall(re.compile(r'\bbu\w+'), input_str)

print(result)

In [None]:
result = sub_str.search(input_str)

print(result)

In [None]:
result = re.finditer(sub_str, input_str)

print(result)

In [None]:
for x in result:
    print(x.group(), x.span())

In [None]:
#\b Returns a match if the specificed characters are at the beginning or end of a word. 


input_str = 'B3tty 3b0u6ht 5ome butt3r but the butt3r w@s bitt3r s0 Betty b0u6ht s0me b3tt3r butt3r to mak3 the bitt3r \
butt3r b3tt3r'

sub_str = r'\B\d\w+'

result = re.findall(sub_str, input_str)
print(result)

In [None]:
result = re.findall(sub_str, input_str)
print(result)

In [None]:
import re

input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

print(input_str)

sub_str = r'\b\w+t\w*'

result = re.findall(sub_str, input_str)
print(result)

In [None]:
sub_str = r'\bt\w+'

result = re.findall(sub_str, input_str)
print(result)

In [None]:
print(input_str.index('the'))

In [None]:
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'\bb0u6\w*'

result = re.search(sub_str, input_str)
print(result)

In [None]:
print(input_str.index('b0u6'))

In [None]:
print(result.group())

In [None]:
print(result.span())

In [None]:
print(result.start())

In [None]:
print(result.end())

In [None]:
print(result.string)

In [None]:
result = re.finditer(sub_str, input_str)

print(result)

In [None]:
for x in result:
    print(x.span(), x.group())

In [None]:
print(f'Found match {result.group()} beginning at {result.start()} and ending at {result.end()} and span is {result.span()}.')

In [None]:
# The match object returned from the search function has the following methods to retrieve the information:

# .span() - returns the beginning and end index numbers of the matched string in a tuple. 
# .string - returns the string passed into the function to be searched. 
# .group() - returns the part of the string where there was a match. 
# .start() - returns the start index
# .end() - returns the end index

In [None]:
print(result.span())

In [None]:
print(result.start())

In [None]:
print(result.end())

In [None]:
print(result.string)

In [None]:
print(result.group())

In [None]:
#Finding all match objects for a pattern using finditer

input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = 'but\w+'

result = re.finditer(sub_str, input_str)

print(result)

In [None]:
for x in result:
    print('-'*100)
    print(x)
    print(f'Found match {x.group()} beginning at {x.start()} and ending at {x.end()}.')

In [None]:
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'\w+e\b'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
#\w returns a match where the strings contains any word characters - upper, lower case alphabets, digits - 0 to 9 and _
# underscore. 
# + (outside square brackets) is a metacharacter specifying 1 or more occurrences. 

# So in the above substring - r'\bbu\w+' - we specified:

# r - this is a raw string - do not consider \ escape characters. 
# '' - quotes denoting strings.
# \b - pattern begins with
# bu - characters to search - so, pattern we are looking for - begins with 'bu'
# \w - After 'bu' search for any word character
# + - One or more occurrences of word character. 

# So, in summary: 

# Search for a pattern in the string which begins with 'bu' and has one or more word characters after bu. Note here that - 
# it wont catch 'bu' if it occurred in the input string in this case.

In [None]:
input_str = 'B3tty b0u6ht some bu but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'\w\d\W'
result = re.findall(sub_str, input_str)

print(result)

In [None]:
input_str = 'B3tty b0u6ht some the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'\bbu\w+'
result = re.search(sub_str, input_str)

print(result.group())
print(result.span())

In [None]:
# Changing the + to * will return bu. 

sub_str = r'\bbu\w*'

result = re.search(sub_str, input_str)

print(result.group())

In [None]:
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'\bbu\w*'

result = re.search(sub_str, input_str)

print(result.group())

In [None]:
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'\w+r\b'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
# \B Returns a match where the specified pattern is NOT at beginning or end of string.


input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

print(input_str)

In [None]:
#print(input_str.index('b!tt3rbutt3r'))

In [None]:
sub_str = r'\B'

result = re.findall(sub_str, input_str)

print(result)



In [None]:
input_str = 'B3tty b0u6ht some butt3r but hello therefore butt3r w@s b!tt3r s0 hermit Betty b0u6ht helium s0me b3tt3r butt3r \
to m@k3 tehe b!tt3r butt3r b3tt3r.'

sub_str = r't\w*he\B'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
import re

input_str = 'B3tty b0u6ht  some butt3r but the butt3r w@s b!tt3r s0 Bettyb0u6ht s0me b3tt3rbutt3r to m@k3 the b!tt3r \
butt3r b3tt3r.'

sub_str = r'\Bb\w+'

result = re.finditer(sub_str, input_str)

for x in result:
    print(x.span(), x.group())
    print(input_str[x.start()-5:x.end()+5])


In [None]:
sub_str = r'\w+ht\w*'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r \
butt3r b3tt3r.'

In [None]:
print(input_str)

In [None]:
name1 = 'My name is Anthony Gonsalvez. Roop nagar, Prem Galli kholi no. 420. Excuse me please'

sub_str = r'[a-zA-Z]*'

result = re.findall(sub_str, name1)

print(result)

In [None]:
print(input_str)

In [None]:
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s 3b!tt3r 50 betty b0u6ht 5ome b3tt3r butt3r to m@k3 the b!tt3r butt3r b3tt3r.'

In [None]:
input_str = 'Rishabh! 8141980914 Neha 0918410984 Rinkey Pal 091490148'

In [None]:
#\d Returns a match where the string contains digits.

sub_str = r'\D+'

result = re.findall(sub_str, input_str)

print(result)


In [None]:
# Note here how '6ht' and '3r' are not separate outputs from '0u6ht' and '3tt3r'. This is because the regex takes a match
# till the pattern continues to match and starts searching for the next match from the next index number. 

In [None]:
input_str = 'Rishabh Rajpoot 8141980914 Neha Agrawal 0918410984 Rinkey Pal 091490148'

In [None]:
sub_str = '\w+\s\w+'

In [None]:
result = re.findall(sub_str, input_str)

print(result)

In [None]:
sub_str = r'\w+\d\w+'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r \
butt3r b3tt3r.'

In [None]:
#\s Returns a match where the string contains a space character. 

In [None]:
sub_str = r'\S+'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
sub_str = r'\w+\St\w+'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
input_str = 'B3ttyb0u6ht some butt3rbut the butt3r w@s b!tt3r s0 bettyb0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r \
butt3r b3tt3r.'

sub_str = r'\w+\Sb\w+'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
#\w Returns a match where the pattern match contains any word characters - a to z, A to Z and 0 to 9

input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r \
butt3r b3tt3r.'

sub_str = r'\w+'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
sub_str = r'\w+\W\w+'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
#\Z - returns a match if the pattern is found at the end of the string(not each word)

input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r \
butt3r b3tt3r'

sub_str = r'\w+t3r\Z'

result = re.findall(sub_str,input_str)

print(result)
#print(result.start())

#print(input_str[:116])

In [None]:
\A - Match at start of string. Same as ^. r'\AB\w+', r'^B\w+' 
\Z - Match at end of string. Same as $. r'\w+3r.\Z', r'\w+3r.$'
\s - Match space
\S - Non space character (anything that is not space)
\w - Alphanumeric characters (a-z, A-Z, 0-9)
\W - Non Alphanumeric characters (Any non-numeric character)
\d - Digits
\D - Non Digits
\b - Pattern should be at beginning of word
\B - Pattern should be present BUT NOT at the beginning of word.

In [None]:
# Regex will match character by character but performing greedy match.

In [None]:
input_str = 'a  xyzbut butbut butbutbut'

sub_str = r'b\w+'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
# Regex used for extracting matches of a pattern from text. 

# Greedy match - Tries to match as much of the text as part of its pattern. It wont stop as soon as a match is found but 
# continue to add characters to the pattern as long as the conditions for the pattern are being met. 

# \A
# \b
# \B
# \d
# \D
# \s
# \S
# \w
# \W
# \Z


In [None]:
# There are also Metacharacters.

# MetaCharacters               Description
# \                            Used to drop the special meaning of character following it
# []                           Represent a character class
# ^                            Matches the beginning of string = \A
# $                            Matches the end of string = \Z
# .                            Matches any character except newline
# |                            Means OR (Matches with any of the characters separated by it).

# And Quantifiers

# ?                            Matches zero or one occurrence - It signifies optional character.
# *                            Any number of occurrences (including 0 occurrences)
# +                            One or more occurrences
# {}                           Indicate the number of occurrences of a preceding regex to match.
# ()                           Enclose a group of Regex


In [None]:
* = 0 or more occurrences of the character preceding it
+ = 1 or more occurrences of the character/sequence preceding it. 
? = 0 or 1 occurrences of the character/sequence preceding it. 

In [49]:
import re

input_str = 'He likes Red colour'

sub_str = r'colou?r'

result = re.findall(sub_str,input_str)

print(result)

['colour']


In [3]:
import re

In [4]:
input_str = 'yamini@yahoo.com sankara.N@yahoo.co.in'

In [5]:
sub_str = r'\w+[.]?\w+@\w+[.]?\w+[.]\w+'

result = re.findall(sub_str, input_str)

print(result)

['yamini@yahoo.com', 'sankara.N@yahoo.co.in']


In [None]:
{} - specifying number of occurrences - either static or range



In [15]:
input_str = 'cot dot Foot got hot hoooooot Jot lot loot Moot not opt pot rot root soot tot toot booooot'

sub_str = r'[a-zA-Z]o{3,7}t'

result = re.findall(sub_str, input_str)

print(result)

['hoooooot', 'booooot']


In [None]:
\A - ^ - Checks at beginning of string. 

[^]

In [None]:
input_str = 'Betty b0u6ht some butt3r but \n the butt3r w@s b!tt3r s0 betty b0u6ht \t s0me b3tt3r butt3r \r to m@k3 the \r\n b!tt3r butt3r b3tt3r.'


print(input_str)

In [None]:
print(input_str)

In [76]:
# ^ = The same as \A - looks for pattern at beginning of string. 
# $ - The same as \Z - looks for pattern at end of string.

SyntaxError: invalid syntax (1322935084.py, line 1)

In [89]:
input_str = 'Betty b0u6ht some butt3r but \n the butt3r w@s b!tt3r s0 betty b0u6ht \t s0me b3tt3r butt3r \r to m@k3 the \r\n b!tt3r butt3r b3tt3r.'

sub_re = r'^B\w+'

result = re.findall(sub_re, input_str)

print(result)

['Betty']


In [90]:
input_str = 'Betty b0u6ht some butt3r but \n the butt3r w@s b!tt3r s0 betty b0u6ht \t s0me b3tt3r butt3r \r to m@k3 the \r\n b!tt3r butt3r b3tt3r'

sub_re = r'.+'

result = re.findall(sub_re, input_str)

print(result)

['Betty b0u6ht some butt3r but ', ' the butt3r w@s b!tt3r s0 betty b0u6ht \t s0me b3tt3r butt3r \r to m@k3 the \r', ' b!tt3r butt3r b3tt3r']


In [95]:
input_str = 'hat sat mat fat rat cat bat Oat pat Tat Wat 1at 7at'

sub_re = r'[^fhtxy]at'

result = re.findall(sub_re, input_str)

print(result)

['sat', 'mat', 'rat', 'cat', 'bat', 'Oat', 'pat', 'Tat', 'Wat', '1at', '7at']


In [None]:
[a-fA-F7-9]

In [96]:
sub_re = r'[a-fT-V1-5]at'

result = re.findall(sub_re, input_str)

print(result)

['fat', 'cat', 'bat', 'Tat', '1at']


In [100]:
input_str = 'I like basketball. He likes football. We dont like volleyball'

sub_re = r'b\w+t|f\w+t'

result = re.findall(sub_re, input_str)

print(result)

['basket', 'foot']


In [None]:
sub_str = r'.+'


result = re.findall(sub_str, input_str)

print(result)

In [4]:
input_str = 'colour humour color humor'

input_strAm = 'color humor'

sub_str = 'humou?r'

result = re.findall(sub_str, input_str)

print(result)

['humour', 'humor']


In [None]:
sub_str = '\.+'

result = re.findall(sub_str, input_str)

print(result)

In [1]:
import re

In [3]:
input_str = 'boot hoot hot got soot moooot looooooot bought however shot  ,oooot'

sub_str = r'bo+t|ho+t'

result = re.findall(sub_str, input_str)
print(result)

['boot', 'hoot', 'hot', 'hot']


In [None]:
input_str = '897983  yashoda@gmail.com vishal@hotmail.com 7981301490'

sub_str = '\d+|\w+@\w+\.com'

result = re.findall(sub_str, input_str)

print(result)

In [6]:
input_str = 'colour color colouuur humour humouur humor humouuur humouuuuuuur'

sub_str = 'humou{2,4}r'
    
result = re.findall(sub_str, input_str)

print(result)

['humouur', 'humouuur']


In [None]:
input_str = 'Ghosts say boo. Babies cry booohooohooobooooohoooohoooo. I wear a brown boot'

sub_str = r'bo{2,4}\w'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
input_str = 'Hello \n Hi \n Satsriakal \n namaste \n arigato \n hola'

print(input_str)

In [None]:
sub_str = '\.+'

result = 

In [None]:
# Metacharacters - Characters with special meaning in RegEx

In [None]:
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty B0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

In [None]:
#[] Any 'set' of characters inside the braces. 

In [None]:
^ - Outside the square brackets means match pattern at beginning of string. 
^ - Inside square brackets means - DO NOT MATCH the given characters

In [None]:
sub_str = r'[^ Bb]\w+'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
input_str = 'hello hill helium happy hover hall hulahoop'

sub_str = 'h[au]l\w*'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
import re

input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 betty b0u6ht some b3tt3r butt3r to make the b!tt3r \
butt3r b3tt3r.'

sub_str = r'\w+t|\w+h'

result = re.findall(sub_str, input_str)

print(result)

# Matches any word with EITHER t or h in it. Gives a match including any alphanumeric characters before and after the t or h
# is found. 

In [None]:
import re

input_str = 'B3tty b0u6ht some cutter but the butt3r \n w@s b!tt3r s0 Betty b0u6ht \n s0me b3tter butt3r to m@k3 \n the bitt3r \
butt3r b3tter'

print(input_str)

In [None]:
sub_str = r'[^Bcb][ue3!]tt\w+'

result = re.findall(sub_str, input_str)

print(result)

# Returns match of any words that have either e or t in them. Matches any number of characters before the e or t is found.

In [None]:
print(input_str)

In [None]:
#^ = \A
#$ = \Z


sub_str = r'\w+r\Z'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
sub_str = r'\w+er$'

result = re.finditer(sub_str, input_str)

for x in result:
    print(x.span(), x.group())

    
print(input_str[90:119])

In [None]:
# Note here how - 'B3tt', 'butt', 'tt', 'bett' - the match wasnt stopped as soon as the first t was found. That is because
# regex patterns perform 'greedy' matches as much as they can match - they will try to match. In the example above - when 
# the regex program reaches the first t of b3tt - it satisfies BOTH conditions that it is an alphanumeric character \w AND
# it is part of the set [et], so it moves on to the next character which also satisfies both conditions - BUT we have not
# specified that after it finds e or t - can there be any text after that? Since we have not specified that - the match
# stops. However, if we were to add more t's after the first 2 - they would continue to get matched till the last t. 

input_str = 'B3tty b0u6ht some butt3r but the buttttt3r w@s b!tt3r s0 betty b0u6ht some b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'b\w+'

result = re.finditer(sub_str, input_str)

print(result)

for x in result:
    print(x.span(), x.group())
    

# print(result.span())
# print(result.start())
# print(result.end())
# print(result.group())
# print(result.string)

In [None]:
print(dir(result))

In [None]:
['Nestle 4Q profit rises'
'Amazon becomes king of online retail'
'CEO of Nestle steps down'

.......]


for x in lst:
    result = re.search('Nestle', x)
    print(result.span(), result.string)

In [None]:
# \ Usually signifies a special sequence but put before a special character can be used to signify escaping. 

input_str = '''B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r \
butt3r b3tt3r.'''

print(input_str)

In [None]:
sub_str = r'\\'

result = re.search(sub_str, input_str)

print(result)

In [None]:
input_str = '''B3tty b0u6ht some butt3r but the butt3r w@s 'btt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r \
butt3r b3tt3r.'''

print(input_str)

In [None]:
sub_str = r'\'\w+'

result = re.search(sub_str, input_str)

print(result)
#print(input_str.index('\\'))

In [None]:
input_str = r"B3tty b0u6ht some butt3r but the bu\tt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r."

print(input_str)

In [None]:
sub_str = r'\\t\w+'
result = re.search(sub_str, input_str)

print(result.group())

In [None]:
input_str = r'B3tty b0u6ht some butt3r but the bu\tt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

print(input_str)

In [None]:
sub_str = r'\\'

result = re.search(sub_str, input_str)

print(result)

In [None]:
# Methods - 

# findall - matches all the patterns that are available in the input string and the output is in a list.
# finditer - matches all the patterns that are available in the input string. Output is an iterator object that contains all
# the found match objects.
# search - matches First occurrence of pattern in input string. Output is a match object
# match - matches pattern if it occurs at beginning of input string. Output is a match object.
# compile - compiles the pattern to search into a pattern object. This pattern object can be saved for reusability also for
# applying regex methods(methods available in re library) directly onto the pattern object.

# Methods available to match objects?
# ((start, end) index, matched pattern)

# span - shows the start and end index of the matched pattern in the form of a tuple. 
# start - shows the start index of the matched pattern. 
# end - ouputs the end index of the matched pattern. 
# group - ouputs the string which was matched by the pattern. 
# string (attribute) - outputs the input string on which the pattern matching was performed. 

# Special Sequences

# \A - searches for pattern at beginning of string
# \Z - searches for pattern at end of string. 
# \b - searches for pattern which must be at the beginning of a word. 
# \B - searches for pattern but pattern must NOT be at the beginning of word. 
# \d - searches for digit characters
# \D - searches for Non-Digit characters
# \s - searches for space character. 
# \S - searches for any character that is not a space.
# \w - searches for alphanumeric characters
# \W - searches for NON - alphanumeric characters

# Metacharacters

# \ - As escape character
# [] - Sets of characters
# ^ - Outside square brackets - means search for pattern at beginning of string (same as \A)
# $ - means search for pattern at end of string (same as \Z)
# . - Looks for any character except newline
# | - OR looks for one pattern OR another.

# Sets examples
# [apr] - will search for any one of the enclosed characters
# [^apr] - will NOT include the enclosed characters as part of search pattern.
# [0-5] - will include range of digits between 0 to 5
# [a-p] - will include range of alphabets between a to p
# [a-zA-Z] - Any alphabets between small a to small z and capital A to capital Z
# [1-5][3-7] - Any numbers that are between 1 to 5 as first digit and 3 to 7 as second digit
# [+*] - Except for caret ^ - the other symbols have no special meaning inside square brackets. So + means we will look for 
# + character in patter, * means we will look for * character in pattern and so on....

In [None]:
1 3,4,5,6,7 - 13 to 17 or 
2 3,4,5,6,7 - 23 to 27
3.....      - 33 to 37
4             43 to 47      
5             53 to 57

In [None]:
# . Signifies any chaaracter except newline characters, \r, \r\n

In [None]:
import re

input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s \n b!tt3r s0 Betty b0u6ht \r\n s0me b3tt3r butt3r \n to m@k3 \
the b!tt3r \n butt3r b3tt3r.'

print(input_str)

In [None]:
sub_str = r'.+'

result = re.findall(sub_str,input_str)

print(result)

In [None]:
sub_str = '\w+.'

result = re.findall(sub_str, input_str)
print(result)

In [None]:
# ^ Starts with specified character - same as \A

In [None]:
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'^B\w+'


result = re.findall(sub_str, input_str)

print(result)

In [None]:
sub_str = r'\AB\w+'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
sub_str = r'\bB\w+'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
sub_str = r'^b\w+'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
input_str = 'b3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'^b.+'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
sub_str = r'\Ab\w+'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
# $ - Checks if whole string ends with specified characters. Same as \Z

In [None]:
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'\w+t3r.$'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
#Note no result since the string does not end with 't3r' but with 't3r.'

sub_str = r't3r.$'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
sub_str = r't3r\.'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
sub_str = r't3r\.'

result = re.search(sub_str, input_str)

print(result)

In [None]:
# * - 0 or more occurrences of specified characters(placed on the right of the characters we wish to specify)

input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!773r\
butt3r b3tt3r.'

input_str = 'fin find final finaal finale finals finish finash finl'
sub_str = r'fina*l'

result = re.findall(sub_str, input_str)

print(result)


In [None]:
# + - 1 or more occurrences of specified characters(placed on the right side of the characters we wish to specify)
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!773r\
butt3r b3tt3r.'

sub_str = r'\w+ht+'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
# {} - Exactly the specified number of occurrences. 
print(input_str)

In [None]:
input_str = 'B3tty b0u6ht some buttttttt3r buter the buttttt3r w@s bitt3r s0 Betty b0u6ht s0me b3tttt3r buttt3r to m@k3 the bi773r \
butt3r b3tt3r.'


In [None]:
sub_str = r'b[3uie]t{2,4}[3e]r'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
# {} - Exactly the specified number of occurrences. 

sub_str = r'\w+t{2}'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
# {} - Exactly the specified number of occurrences. 

sub_str = r'\w+t{2}\w+'

result = re.findall(sub_str, input_str)

print(result)


In [None]:
# {x,y} - Between the specified number of occurrences. 

input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!t3r s0 Betty b0u6ht s0me b3tttttt3r butt3r to m@k3 the b!t3r\
butt3r b3tttt3r.'

sub_str = r'\w+[3ue]t{2,4}'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
# | - Either / or any of the specified characcters in pattern. 

input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'


sub_str = r'b\w+|s\w+'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
# ? - Makes the character preceding the ? mark optional.

text = 'The colonel colours the car in a Red color'

sub_str = r'colou?r'

result = re.findall(sub_str, text)

print(result)

In [None]:
# As you probably noticed, the regex query matched both 'colour' and 'color' since it was optional to match the u, if
# present it was matched, even if not present the pattern was matched.

In [None]:
# () Capture and group the specified characters pattern. Allows you to match (or capture) a specific group of characters
# collectively.

In [None]:
input_str = 'B3tty b0u6ht some btt3r but the btt3r w@s btt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'[bB]3?t\w+'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
sub_str = r'(bar)'

result = re.search(sub_str, input_str)

print(result)

In [None]:
# The difference though - between regular regex without the parenthesis and with is that now the characters defined in the
# parentheses are treated as one group. e.g.

import re

input_str = 'foo barbar baz'

sub_str = r'\sbar'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
# As we can see it took the r as an optional character in this case. 

In [None]:
# input_str = 'B3tty b0u6ht boon goon opt option soon moon some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
# butt3r b3tt3r.'

# sub_str = r'(opt(ion)?)'

# result = re.findall(sub_str, input_str)

# print(result)

In [36]:
input_str = 'B3tty b0u6ht s0me butter but the butter was bitt3r so B3tty b0u6ht some butter butter to mak3 the bitt3r \
butter b3tt3r but butt bute butr'

sub_str = r'(but(ter)?)'

result = re.findall(sub_str, input_str)

print(result)

[('butter', 'ter'), ('but', ''), ('butter', 'ter'), ('butter', 'ter'), ('butter', 'ter'), ('butter', 'ter'), ('but', ''), ('but', ''), ('but', ''), ('but', '')]


In [35]:
for x in result:
    print(x.groups()[0])

butter
but
butter
butter
butter
butter
but
but
but
but


In [22]:
#We can use nested grouping to capture specific characters.

input_str = 'B3tty b0u6ht s0me butter but the butter was bitt3r so B3tty b0u6ht some butter butter to mak3 the bitt3r \
butter b3tt3r but butt bute butr'

sub_str = r'(but)(ter)?'

result = re.finditer(sub_str, input_str)

print(result)

<callable_iterator object at 0x0000020F89C91400>


In [81]:
print(input_str)

B3tty b0u64t s0me butter but t4e butt3r was bitt3r 3rxB3 so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r butt3r b3tt3r
B3tty b0u64t s0me butter but t4e butt3r was bitt3r so 3rxB3 B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r butt3r b3tt3r
B3tty b0u64t s0me butter but t4e butt3r was bitt3r so 3rxB3 B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r butt3r b3tt3r


In [105]:
sub_str2 = r'(but)(ter)?'
result = re.findall(sub_str, input_str)
result2 = re.finditer(sub_str2, input_str)

In [106]:
print(result)

[('but', 'ter'), ('but', ''), ('but', 'ter'), ('but', 'ter'), ('but', 'ter'), ('but', 'ter'), ('but', ''), ('but', ''), ('but', ''), ('but', '')]


In [107]:
print(result2)
#print(result.group())

<callable_iterator object at 0x0000025D9AC20D90>


In [108]:
for x in result2:
    print(x)
    print(x.groups())
    print(x.group(0))
    print(x.group(1))
    print(x.group(2))

<re.Match object; span=(18, 24), match='butter'>
('but', 'ter')
butter
but
ter
<re.Match object; span=(25, 28), match='but'>
('but', None)
but
but
None
<re.Match object; span=(33, 39), match='butter'>
('but', 'ter')
butter
but
ter
<re.Match object; span=(72, 78), match='butter'>
('but', 'ter')
butter
but
ter
<re.Match object; span=(79, 85), match='butter'>
('but', 'ter')
butter
but
ter
<re.Match object; span=(105, 111), match='butter'>
('but', 'ter')
butter
but
ter
<re.Match object; span=(119, 122), match='but'>
('but', None)
but
but
None
<re.Match object; span=(123, 126), match='but'>
('but', None)
but
but
None
<re.Match object; span=(128, 131), match='but'>
('but', None)
but
but
None
<re.Match object; span=(133, 136), match='but'>
('but', None)
but
but
None


In [None]:
# Note how the groups() method gave out a tuple of the matches. We have seen 3 captures in our regex - outer capture 1, 
# inner capture 1 and inner capture2. 

# ((innercap1)(innercap2)) - Not all captures may have participated in the group. To get the breakdown of the groups
# captured by the regex in a match object, we can use the group or groups methods. 

In [None]:
span
start
end
group
string

groups()

In [None]:
input_str = 'I love basketball. But I am not very good at it.'
input_str2 = 'I also like badminton which I am actually pretty good at'

sub_str = r'I (also )?(love|like) (basketball|badminton)'

result = re.search(sub_str, input_str)

print(result)
print(result.group())
print(result.groups())

In [45]:
input_str = 'I love basketball. But I am not very good at it. I also like badminton which I am actually pretty good at. \
Rohit does not play either badminton or basketball'

sub_str = r'I (also )?(love|like) (basketball|badminton)'

result = re.finditer(sub_str, input_str)

print(result)

for x in result:
    print(x)
    print(x.group())
    print(x.groups())

<callable_iterator object at 0x0000020F89DD81C0>
<re.Match object; span=(0, 17), match='I love basketball'>
I love basketball
(None, 'love', 'basketball')
<re.Match object; span=(49, 70), match='I also like badminton'>
I also like badminton
('also ', 'like', 'badminton')


In [None]:
result = re.search(sub_str, input_str2)

print(result)
print(result.group())
print(result.groups())

In [3]:
import re
inputstr = 'Michael Jordan email mj@gmail.com  Kobe Bryant email kobe@hotmail.com LeBron James email lbj@rediff.com'

pattern = r'(\w+) (\w+) email (\w+@\w+[.]com)'

result = re.finditer(pattern, inputstr)

print(result)

for player in result:
    print(f'Name : {player.group(1)}')
    print(f'Last Name : {player.group(2)}')
    print(f'Email : {player.group(3)}')
          

<callable_iterator object at 0x000001686F1BEFD0>
Name : Michael
Last Name : Jordan
Email : mj@gmail.com
Name : Kobe
Last Name : Bryant
Email : kobe@hotmail.com
Name : LeBron
Last Name : James
Email : lbj@rediff.com


In [None]:
# Sets - A set in Regex is a set of characters inside a pair of square brackets [] with a special meaning. 

# Set        Description
# [apz]      Returns a match where any one of the specified characters (a, p, or z) are present
# [a-e]      Returns a match for any lower case character, alphabetically between a and e
# [^apz]     Returns a match for any character EXCEPT a, p, and z
# [0279]     Returns a match where any of the specified digits (0, 2, 7 or 9) are present
# [0-5]      Returns a match for any digit between 0 and 5
# [0-5][0-9] Returns a match for any two-digit numbers from 00 and 59	
# [2-5][4-7] Returns a match of numbers from 24 to 27, 34 to 37, 44 to 47 and 54 to 57
# [a-zA-Z]   Returns a match for any character alphabetically between a and z, lower case OR upper case	
# [+]        In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character
#            in the string, [*] for any * character and so on. Only the ^ character retains its special meaning.


. - Anything except newline



In [None]:
0  - 0 1 2 3 4 5 6 7 8 9 = 00 to 09
1  - 0 1 2 3 4 5 6 7 8 9 = 10 to 19
2  - 0 1 2 3 4 5 6 7 8 9
3  - 0 1 2 3 4 5 6 7 8 9
4  - 0 1 2 3 4 5 6 7 8 9
5  - 0 1 2 3 4 5 6 7 8 9 = 50 to 59




In [None]:
input_str = '0279ABC'
input_str2 = '0ABC'
input_str3 = '9ABC'

sub_str = r'[0279]\w+'

In [None]:
^ - Search at the beginning of string (same as \A) - Slight difference I will explain later
$ - Search at END of string (same as \Z)

^ - Search at the beginning of string (same as \A) - Slight difference I will explain later - This meaning caret character
# is ONLY APPLICABLE outside square brackets (character class). INSIDE square brackets - it means except. 


In [None]:
0 [0 1 2 3 4 5 6 7 8 9] - 00 ~ 09
1 [0 1 2 3 4 5 6 7 8 9] - 10 ~ 19
2 [0 1 2 3 4 5 6 7 8 9] - 20 ~ 29
3 [0 1 2 3 4 5 6 7 8 9] - 30 ~ 39
4 [0 1 2 3 4 5 6 7 8 9] - 40 ~ 49
5 [0 1 2 3 4 5 6 7 8 9] - 50 ~ 59

In [None]:
24,25,26,27
34,35,36,37
44,45,46,47
54,55,56,57

In [None]:
inputstr = 'abpzxyz'

substr = r'[^apz]\w+'

result = re.findall(substr, inputstr)

print(result)

In [None]:
# [apz] - Square brackets around specified characters - Returns a match where any one of the specified characters
# (a, p, or z) are present


input_str = 'Betty bought some butter but the butter was bitter so Betty bought some better butter to make the bitter \
butter better'

sub_str = r'\w+[mh]'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
# [a-e] - Returns a match for any lower case character, alphabetically between a and e

input_str = 'Betty bought some bubtter but the butter was bitter so Betty bocught some bedtter butter to make the bitter \
butter better'

sub_str = r'\w+[a-d]'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
# [^apz] - Returns a match for any character EXCEPT a, p, and z

input_str = 'betty b0u6ht s0me butter but the but^3r was bitter so Betty bou6ht s0m3 better butt3r to mak3 th3 bitt3r \
butt3r b3tt3r'

sub_str = r'b\w+t[\^3]\w+'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
# [0123]     Returns a match where any of the specified digits (0, 1, 2, or 3) are present

input_str = 'B3tty b0u6ht s0me butter but the butt3r was bitter so Betty bou6ht s0m3 better butt3r to mak3 th3 bitt3r \
butt3r b3tt3r'

sub_str = r'\w+[03]'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
# [0-9]      Returns a match for any digit between 0 and 9

input_str = 'B3tty b0u64t s0me butter but 43 t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r \
butt3r b3tt3r'

sub_str = r'\w+[0-9]'

result = re.findall(sub_str, input_str)
print(result)

In [None]:
# compile method

In [26]:
input_str = 'B3tty b0u64t s0me butter but 43 t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r \
butt3r b3tt3r'

In [27]:
sub_str = r'b\w+'

In [28]:
result = re.findall(sub_str, input_str)

print(result)

['b0u64t', 'butter', 'but', 'butt3r', 'bitt3r', 'bou64t', 'b3tt3r', 'butt3r', 'bitt3r', 'butt3r', 'b3tt3r']


In [33]:
x = str(10)

print(type(x))

<class 'str'>


In [29]:
sub_str2 = re.compile(r'b\w+')

print(sub_str2)
print(type(sub_str2))

re.compile('b\\w+')
<class 're.Pattern'>


In [31]:
result = re.findall(re.compile(sub_str), input_str)

print(result)

['b0u64t', 'butter', 'but', 'butt3r', 'bitt3r', 'bou64t', 'b3tt3r', 'butt3r', 'bitt3r', 'butt3r', 'b3tt3r']


In [35]:
result = re.findall(sub_str, input_str)
result2 = sub_str2.findall(input_str)

print(result)
print(result2)



['b0u64t', 'butter', 'but', 'butt3r', 'bitt3r', 'bou64t', 'b3tt3r', 'butt3r', 'bitt3r', 'butt3r', 'b3tt3r']
['b0u64t', 'butter', 'but', 'butt3r', 'bitt3r', 'bou64t', 'b3tt3r', 'butt3r', 'bitt3r', 'butt3r', 'b3tt3r']


In [None]:
re

functions1(obj, input_str)
functions2
functions3



class pattern

functions1(input_str)
functions2





class match

functions3

In [None]:
# search method

In [38]:

print(sub_str)
print(input_str)

b\w+
B3tty b0u64t s0me butter but 43 t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r butt3r b3tt3r


In [36]:
result = re.search(sub_str, input_str)

print(result)

<re.Match object; span=(6, 12), match='b0u64t'>


In [39]:
print(result.span())

(6, 12)


In [40]:
print(result.start())

6


In [41]:
print(result.end())

12


In [42]:
print(result.group())

b0u64t


In [43]:
print(result.string)

B3tty b0u64t s0me butter but 43 t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r butt3r b3tt3r


In [56]:
# match 

sub_str = r'b\w+'

result = re.match(sub_str, input_str)

print(result)

None


In [57]:
# finditer

sub_str2 = re.compile(r'b\w+')
result = re.finditer(sub_str,input_str)
result2 = sub_str2.finditer(input_str)

print(result)

<callable_iterator object at 0x0000025D9AC20E80>


In [59]:
for x in result2:
    print(x)
    print(x.span())
    print(x.group())

<re.Match object; span=(6, 12), match='b0u64t'>
(6, 12)
b0u64t
<re.Match object; span=(18, 24), match='butter'>
(18, 24)
butter
<re.Match object; span=(25, 28), match='but'>
(25, 28)
but
<re.Match object; span=(36, 42), match='butt3r'>
(36, 42)
butt3r
<re.Match object; span=(47, 53), match='bitt3r'>
(47, 53)
bitt3r
<re.Match object; span=(63, 69), match='bou64t'>
(63, 69)
bou64t
<re.Match object; span=(75, 81), match='b3tt3r'>
(75, 81)
b3tt3r
<re.Match object; span=(82, 88), match='butt3r'>
(82, 88)
butt3r
<re.Match object; span=(101, 107), match='bitt3r'>
(101, 107)
bitt3r
<re.Match object; span=(108, 114), match='butt3r'>
(108, 114)
butt3r
<re.Match object; span=(115, 121), match='b3tt3r'>
(115, 121)
b3tt3r


In [None]:

for y in result

In [None]:
sub_str = r'(s0me)|(t4e)'

result = re.finditer(sub_str,input_str)

print(result)

for x in result:
    print(x.group(), x.span())

In [None]:
# [0-5][0-9] Returns a match for any two-digit numbers from 00 and 59

input_str = 'B3tty b0u64t s0me butter but 43 t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r \
butt3r b3tt3r'

sub_str = r'\w+[2-6][3-5]'

result = re.findall(sub_str, input_str)

print( result)

In [None]:
sub_str = r'\w+[2-6][4-5]'

result = re.findall(sub_str, input_str)

print( result)

In [None]:
# [a-zA-Z]   Returns a match for any character alphabetically between a and z, lower case OR upper case

input_str = 'B3tty b0ught s0m3 b~tt3r'

sub_str = r'[a-zA-Z]'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
# [+]        In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character
#            in the string, [*] for any * character and so on.

In [None]:
import re

input_str = 'B3++y b*ugh+ s*m3 b~tt3r'

sub_str = r'\w+[*~+]'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
result 

In [None]:
# Flags - Most regex methods allow a third parameter called flags. The most common flags used are: 

In [None]:
# Short Name          Long Name         Effect
# re.I                re.IGNORECASE     Makes matching of alphabetic characters case-insensitive
# re.M                re.MULTILINE      Causes start-of-string and end-of-string anchors to match embedded newlines
# re.S                re.DOTALL         Causes the dot metacharacter to match a newline

In [53]:
input_str = 'Betty bought some butter but the butter was bitter so Betty bought some better butter to make the bitter butter better'

sub_str = r'b\w+'

result = re.findall(sub_str, input_str, flags = re.I)

print(result)

['Betty', 'bought', 'butter', 'but', 'butter', 'bitter', 'Betty', 'bought', 'better', 'butter', 'bitter', 'butter', 'better']


In [None]:
. - Any character except new line ---> when we use re.S - dot also starts accepting newline as match.

[.] - Only dot

In [55]:
input_str = '''Alikya entered a cooking competion. She placed 2nd
Yashoda went to the library. She checked out "The Forest of enchantement" from there
Vishal plays chess. He hopes to one day be a grandmaster'''

print(input_str)

Alikya entered a cooking competion. She placed 2nd
Yashoda went to the library. She checked out "The Forest of enchantement" from there
Vishal plays chess. He hopes to one day be a grandmaster


^ beginning of string - \A - same but slight difference
$ end of string - \Z - same but slight difference


In [60]:
sub_str = r'^[A-Z]\w+'

result = re.findall(sub_str, input_str, flags = re.M)

print(result)

['Alikya', 'Yashoda', 'Vishal']


In [70]:
# Ignore case

input_str = 'B3tty b0u64t s0me butter but the butt3r was bitt3r so B3tty bought s0m3 b3tt3r butt3r to mak3 the bitt3r \n\
butt3r b3tt3r'

sub_str = r'b\w+'

result = re.findall(sub_str, input_str, flags = re.I)

print(result)

['B3tty', 'b0u64t', 'butter', 'but', 'butt3r', 'bitt3r', 'B3tty', 'bought', 'b3tt3r', 'butt3r', 'bitt3r', 'butt3r', 'b3tt3r']


In [None]:
result = re.findall(sub_str, input_str, flags = re.I)

print(result)

In [None]:
# Multiline

In [61]:
input_str = 'B3tty b0u64t s0me butter but t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r \
butt3r b3tt3r\nBetty b0u64t s0me butter but t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r \
butt3r b3tt3r\nBitty b0u64t s0me butter but t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r \
butt3r b3tt3r'


print(input_str)

B3tty b0u64t s0me butter but t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r butt3r b3tt3r
Betty b0u64t s0me butter but t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r butt3r b3tt3r
Bitty b0u64t s0me butter but t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r butt3r b3tt3r


In [None]:
sub_str = r'^B\w+y'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
result = re.findall(sub_str, input_str, flags = re.M)

print(result)

In [62]:
#Dotall includes the \n characters in the . search character set. 

input_str = 'B3tty b0u64t s0me butter but t4e butt3r was bitt3r 3rxB3 so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r \
butt3r b3tt3r\nB3tty b0u64t s0me butter but t4e butt3r was bitt3r so 3rxB3 B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r \
butt3r b3tt3r\nB3tty b0u64t s0me butter but t4e butt3r was bitt3r so 3rxB3 B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r \
butt3r b3tt3r'

print(input_str)

B3tty b0u64t s0me butter but t4e butt3r was bitt3r 3rxB3 so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r butt3r b3tt3r
B3tty b0u64t s0me butter but t4e butt3r was bitt3r so 3rxB3 B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r butt3r b3tt3r
B3tty b0u64t s0me butter but t4e butt3r was bitt3r so 3rxB3 B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r butt3r b3tt3r


In [66]:
sub_str = r'3r.B\w'

result = re.findall(sub_str, input_str, flags = re.S|re.I)

print(result)

['3rxB3', '3r bu', '3r bu', '3r b3', '3r\nB3', '3rxB3', '3r bu', '3r bu', '3r b3', '3r\nB3', '3rxB3', '3r bu', '3r bu', '3r b3']


In [76]:
result = re.findall(sub_str, input_str, flags=re.S | re.I)

print(result)

['3rxB3', '3r bu', '3r bu', '3r b3', '3r\nB3', '3rxB3', '3r bu', '3r bu', '3r b3', '3r\nB3', '3rxB3', '3r bu', '3r bu', '3r b3']


In [None]:

input_str = 'B3tty b0u64t s0me butter but t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r \n\
butt3r b3tt3r'

sub_str = r'b.*r'

result = re.findall(sub_str, input_str)

print(result)

In [None]:
input_str = 'B3tty b0u64t s0me butter but t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r \n\
butt3r b3tt3r'

sub_str = r'b\w+'

result = re.finditer(sub_str, input_str)

print(result)


for x in result:
    print(x)

In [None]:
# Other methods in re module

# We have already seen search (along with span, start, end, string, group and groups methods on match objects) method.
# We have also seen findall method. 

# Some other methods in regex are:

In [None]:
# compile method - compiles the stated regex into a regex object that can be reused and methods can be applied to it
# directly. 

input_str = 'B3tty b0u64t s0me butter but t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r \n\
butt3r b3tt3r'


sub_str = re.compile(r'\bb\w+')
sub_str1 = r'\bb\w+'

print(sub_str)
print(type(sub_str))

result = re.findall(sub_str, input_str)

print(result)

In [None]:
print(sub_str.findall(input_str))

In [None]:
input_str = 'B3tty b0u64t s0me butter but t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r \n\
butt3r b3tt3r'

sub_str = r'\bB\w+'

result = re.match(sub_str, input_str)

print(result)

In [None]:
for x in result:
    print(x.group(), x.span())

In [None]:
print(result.string)

In [None]:
result

In [None]:
ss = r'\w+[+*~]'

result = re.findall(ss, input_str)

In [None]:
ss = re.compile(r'\w+[+*~]')

result = ss.findall(input_str)

print(result)
print(type(ss))

In [None]:
# Note above when we used the compile method on the regex we converted it to a re.Pattern object. And now we can apply the
# methods directly on the ss object. If we have a lot of operations to perform with the same regex pattern, or the pattern
# may be used frequently, it may be better to save it as a re.Pattern object.

# Intenally compilation of substring happens when we call the sub_str we want to search as the parameter to the different
# method as follows: 

sub_str = re.compile(r'\w+[+*~]')

result = re.findall(re.compile(sub_str), input_str)

print(result)

In [None]:
# For our current module, since we were changing the regex frequently, we did not compile the regex patterns. However, 
# complex patterns that are frequently used, more often than not are compiled and stashed. 

In [None]:
#match method - only returns the match if it is at the beginning of the string

In [None]:
input_str = 'Xyz abc xyz abc'

sub_str = r'xyz'

result = re.match(sub_str, input_str, flags = re.I)

print(result)

In [None]:
sub_str = r'xyz'

result = re.match(sub_str, input_str)

print(result)

In [None]:
sub_str = r'xyz'

result = re.match(sub_str, input_str, flags = re.I)

print(result)

In [72]:
input_str = 'Rajasekhar scored 100, Rohit bowled 5/20, Srinivas fielded superbly, Utkarsh danced energetically, \
Lovish dropped a catch'

result = input_str.split(',')

print(result)

['Rajasekhar scored 100', ' Rohit bowled 5/20', ' Srinivas fielded superbly', ' Utkarsh danced energetically', ' Lovish dropped a catch']


In [76]:
sub_str = r'\w+ed\s'

#result = re.split(',', input_str)
result = input_str.split(',')
print(result)

['Rajasekhar scored 100', ' Rohit bowled 5/20', ' Srinivas fielded superbly', ' Utkarsh danced energetically', ' Lovish dropped a catch']


In [77]:
lst1 = []
for x in result:
    lst1.append(re.split(sub_str, x))
    
print(lst1)

[['Rajasekhar ', '100'], [' Rohit ', '5/20'], [' Srinivas ', 'superbly'], [' Utkarsh ', 'energetically'], [' Lovish ', 'a catch']]


In [109]:
#split method in re takes 4 parameters

#re.split(pattern, string, maxsplit=0, flags=0)

#1. The regex pattern - mandatory
#2. The string to be checked - mandatory
#3. Maxsplit - count of how many maximum splits we want
#4. flags - as discussed above. 

input_str = 'B3tty b0u64t s0me butter but t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butter to mak3 t43 bitt3r \n\
butt3r b3tt3r'
sub_str = r'\d[a-zA-Z]'

result = re.split(sub_str, input_str)

print(result)

['B', 'ty b', '6', ' s', 'e butter but t', ' butt', ' was bitt', ' so B', 'ty bou6', ' s', '3 b', 't', ' butter to mak3 t43 bitt', ' \nbutt', ' b', 't', '']


In [110]:
result = re.split(sub_str, input_str, maxsplit = 5)

print(result)

['B', 'ty b', '6', ' s', 'e butter but t', ' butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butter to mak3 t43 bitt3r \nbutt3r b3tt3r']


In [None]:
re.Ignorecase, re.I

In [None]:
input_str = 'B3tty b0u64t S0me butter but t4e Butter was bitter so Betty bought some Better butter to make the Bitter butter better'
sub_str = r'b[uei]tter'

#result = re.findall(sub_str, input_str, flags = re.I)


result = re.split(sub_str, input_str, flags = re.I)

print(result)

In [81]:
#sub method takes in 5 parameters

#1. The regex expression to be matched - mandatory
#2. The replacement string - mandatory
#3. The string to be checked - mandatory
#4. Count - the max number of times the replacement is to be performed - optional
#5. Flag - optional

input_str = 'ayz abc xyz Abc xyz aBc'

sub_str = r'a\w+'
repl_str = r'PQR'

result = re.sub(sub_str, repl_str, input_str, count = 2)
print(result)

PQR PQR xyz Abc xyz aBc


In [None]:
repl = r'pqrst'

print(input_str)

In [None]:
result = re.sub(sub_str, repl, input_str, flags = re.I, count = 2)

print(result)

In [None]:
result = re.sub(sub_str, repl, input_str, flags = re.I)

print(result)

In [None]:
input_str = 'Xyz Abc xyz abc xyz aBc'

sub_str = r'abc'

repl = r'pqr'

print(input_str)

In [None]:
result = re.sub(sub_str, repl, input_str, flags = re.I, count = 2)

print(result)

In [None]:
print(input_str)
print(sub_str)

In [82]:
print(input_str)
print(sub_str)
print(repl_str)

ayz abc xyz Abc xyz aBc
a\w+
PQR


In [85]:
#Subn method is the same as the sub method except it provides the replacement count along with the replaced string as a 
# tuple

result = re.subn(sub_str, repl_str, input_str, count = 4)

print(result)

('PQR PQR xyz Abc xyz PQR', 3)


In [None]:
result = re.subn(sub_str, repl_str, input_str, flags = re.I)

print(result)

In [None]:
input_str = 'abc xyz XYZ lmn'
repl_str = 'PQR'
sub_str = r'a\w+'

In [None]:
result = re.subn(sub_str, repl_str, input_str, count = 2)

print(result)

In [None]:
str_email = '''boleh di kirim ke email saya ekoprasetyo.crb@outlook.com tks...
boleh minta kirim ke db.maulana@gmail.com. 
dee.wien@yahoo.com. .
deninainggolan@yahoo.co.id Senior Quantity Surveyor
Fajar.rohita@hotmail.com, terimakasih bu Cindy Hartanto
firmansyah1404@gmail.com saya mau dong bu cindy
fransiscajw@gmail.com 
Hi Cindy ...pls share the Salary guide to donny_tri_wardono@yahoo.co.id thank a'''

In [None]:
#Ansh

sub_string=r'\[a-zA-Z0-9]\S*@\S*[a-zA-Z]'
result = re.findall(sub_string, str_email)

print(result)


In [None]:
#Chandrabose

substr=r'\w+[.]?\w+@\w+.co(.)?\w+'
result=re.finditer(substr,str_email)

for x in result:
    print(x.group())



In [None]:
#Grishma

sub_str= r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+"
result= re.findall(sub_str , str_email)
print(result)


In [None]:
#mithila

#sub_str = r'\b\w+[.]?\w+@\w+.\w+\b'
sub_str = r'\b\w+[.]?\w+@'

result= re.findall(sub_str , str_email)
print(result)



In [None]:
#Sachin

sub_str = r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+"
result = re.findall(sub_str, str_email)
print(result)

In [None]:
# Find all the email addresses in the following text (no extra spaces or characters allowed. Include all the different domain
# names such as gmail.co, yahoo.co.id etc..)


str_email = '''boleh di kirim ke email saya ekoprasetyo.crb@outlook.com tks...
boleh minta kirim ke db.maulana@gmail.com. 
dee.wien@yahoo.com. .
deninainggolan@yahoo.co.id Senior Quantity Surveyor
Fajar.rohita@hotmail.com, terimakasih bu Cindy Hartanto
firmansyah1404@gmail.com saya mau dong bu cindy
fransiscajw@gmail.com 
Hi Cindy ...pls share the Salary guide to donny_tri_wardono@yahoo.co.id thank a'''

In [None]:
# Find all the phone numbers. No extra spaces or characters allowed. 

str_phone = '''<p><strong>Kuala Lumpur</strong><strong>:</strong> +60 (0)3 2723 7900</p>
        <p><strong>Mutiara Damansara:</strong> +60 (0)3 2723 7900</p>
        <p><strong>Penang:</strong> + 60 (0)4 255 9000</p>
        <h2>Where we are </h2>
        <strong>&nbsp;Call us on:</strong>&nbsp;+6 (03) 8924 8686
        </p></div><div class="sys_two">
    <h3 class="parentSchool">General enquiries</h3><p style="FONT-SIZE: 11px">
     <strong>&nbsp;Call us on:</strong>&nbsp;+6 (03) 8924 8000
+ 60 (7) 268-6200 <br />
 Fax:<br /> 
 +60 (7) 228-6202<br /> 
Phone:</strong><strong style="color: #f00"> +601-4228-8055</strong>'''

In [None]:
# Sahel

sub_str = r'\S+@\S+'

result = re.findall(sub_str, str_email)

print(result)

In [None]:
#Adarsh

sub_str = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b'

z = re.findall(sub_str, str_email)
print(z)


In [None]:
#Omkar

sub_str = r'\w*[.]?\w*@\w+[.]\w+[.]?\w+'

result = re.findall(sub_str, str_email)

print(result)

In [None]:
#Nagarajan

sub_str = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b'




In [None]:
#Hitesh

sub_str = r'\w*[.]?\w*@\w*.\w*.?\w*?'

result = re.findall(sub_str, str_email)

print(result)

In [None]:
#Geetha

a= r'\w+[.]?\w+@\w+.com?'
result=re.findall(a,str_email)
print(result)


In [None]:
#Shivam

a= r'(\w+[.]?\w+?@\w+((.com)|(.co.id)))'
result=re.findall(a,str_email)
print(result)


In [None]:
#Vidya

sub_str = re.compile(r'[A-z.0-9]*@\w+[.](com|co[.]id)')   
result = sub_str.finditer(str_email)  
for match in result:
    print(match)


In [None]:
#Ashwini

import re
str_email = '''boleh di kirim ke email saya ekoprasetyo.crb@outlook.com tks...
boleh minta kirim ke db.maulana@gmail.com. 
dee.wien@yahoo.com. .
deninainggolan@yahoo.co.id Senior Quantity Surveyor
Fajar.rohita@hotmail.com, terimakasih bu Cindy Hartanto
firmansyah1404@gmail.com saya mau dong bu cindy
fransiscajw@gmail.com 
Hi Cindy ...pls share the Salary guide to donny_tri_wardono@yahoo.co.id thank a'''


sub_str = r'\w+[.]?\w+[@]\w+[.]\w+[.]?\w+'
result = re.findall(sub_str, str_email)
print(result)

In [None]:
#Archana

sub_str = r'\w+@\w+[.]\w+[.]\w+|\w+[._]?\w+@\w+[.]\w+'

result = re.findall(sub_str, str_email)
print(result)

In [None]:
#Sowjanya

result = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", str_email)
print(result)

In [None]:
#Apurva

sub_str=r'[^ \n]+[@]\w+[.]\w+[.]?\w*'
result=re.finditer(sub_str,str_email)
for i in result:
    print(i.group())

In [None]:
#Vinti

sub_str=r'[\w\._]+@+[\w\._]'

result = re.findall(sub_str, str_email)

print(result)


In [None]:
#Shilpy

sub_str=r'\w+[._]?\w*\@\w+.\w.*?\w+\w.*?'
result=re.finditer(sub_str,str_email)
for x in result:
    print(x.group())


<!-- # What is RegEx library - for pattern matching in text to be able to extract certain patterns which could be examples - 
email addresses, phone numbers, names, etc. 

It performs a 'greedy match' i.e. will check for match character by character and then continue matching the pattern till
conditions specified in the pattern are being met - will output longest possible match and then start matching from next
character again.

Why do we raw string 'r' in front of a pattern? Because regex sequences also use a lot of \ and this is also an escape 
character in Python. To avoid conflict, we make everything a raw string.

# Special sequences
\A - Match pattern at beginning of string.
\b - Words begin or end with the specified letter (or sequence). 
\B - Match the pattern as long as the match is NOT at the beginning of the word.
\d - match digits.
\D - Match any non digit
\s - Match space
\S - Match any non space character
\w - Match alphanumeric (which means a-z, A-Z, 0-9, _)
\W - Match non alphanumeric
\Z - Match pattern at end of string.


# Methods
findall - finds all matches to the pattern and outputs all the matches in a list. 
search - finds the first occurrence of the pattern and outputs a match object. 
match - finds the pattern at the beginning of the string. Outputs - match object
finditer - finds all occurrences of matches to the pattern in the string and outputs an iterator object which in turn contains all the match objects. 
compile - converts the pattern to be searched into a re.pattern object. Makes it easier to search for a given pattern that
is used regularly and also methods can now be performed directly on this pattern object. 

# Match object methods
.span - outputs the beginning and ending index numbers of the match in the form of a tuple. 
.start - ouputs the beginning index number of the match
.end - outputs the ending index number of the match
.group - outputs the match that the pattern was able to extract. 
.string (attribute) - outputs the input string on which pattern matching was performed.  

 -->

What is the re library?

RegEx (Or regular expressions) - is a module that helps us search for patterns in text / string. 

###### Methods 
findall
finditer
search
match
compile
sub
subn
split

###### Sequences
\A - Search pattern at the beginning of a string. 
\b - Search for pattern at the beginning (or end) of the word
\B - Opposite i.e. pattern should be present but NOT at the begining (or end) of the word.
\d - Search for digits
\D - Search for Non Digit (alphabets, special chars, spaces, newline characters)
\s - Search for whitespace
\S - Search for non-whitespace characters(alphanumeric, special chars, newline chars). 
\w - Search for alphanumeric i.e a-z, A-Z, 0-9, _
\W - Search for non-alphanumeric i.e special characters, spaces, newline chars.
\Z - Search pattern at the end of the string. 

In [None]:
import re

In [None]:
input_str = '''Yashoda's phone number is 133456 and email is yashoda.learnbay@gmail.com and the phone number of Vineet is 1311314 along 
with an email ID of vineet@hotmail.com. The Learnbay teams email ID is teams@learnbay.co.in and can be reached at 1130851. Rikkis
phone number is 1458195 and he has not given his email ID'''

In [None]:
patt = r'(\b[A-Z][a-z]+\b)[^0-9]+(\d+).+(\b[a-z.]+@[a-z.]+[a-z]\b)'

result = re.finditer(patt, input_str)

In [None]:
print(result)

In [None]:
for x in result:
    print(x.groups())
    print(x.group(1), x.group(2), x.group(3))

In [4]:
import re

In [5]:
input_str = '''I play basketball. She plays the piano. He plays drums. He also plays cricket'''

sub_str = r'plays?\s(the )?\w+'

result = re.findall(sub_str, input_str)

print(result)

['', 'the ', '', '']


In [6]:
sub_str = r'plays?\s(?:the )?\w+'

result = re.findall(sub_str, input_str)

print(result)

['play basketball', 'plays the piano', 'plays drums', 'plays cricket']
