In [1]:
import re

In [2]:
t = 's'
t = 'visionnlp'
t = 'i love nlp.'

# search 

- The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise. 

In [3]:
# search 
text = 'I got 90% marks in nlp assignment. I lost 10 because of regex.'
pat = r'\d+%'

match = re.search(pat, text)
match

<re.Match object; span=(6, 9), match='90%'>

In [4]:
match.group(0)

'90%'

In [5]:
import re
str = 'an example word:cat!!'
match = re.search(r'word:\w+', str)
if match:
    print('found', match.group())
else:
    print('did not find')

found word:cat


The code `match = re.search(pat, str)` stores the search result in a variable named match. Then the if-statement tests the match -- if true the search succeeded and `match.group()` is the matching text (e.g. 'word:cat'). Otherwise if the match is false (None to be more specific), then the search did not succeed, and there is no matching text.

The `'r'` at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions. I recommend that you always write pattern strings with the `'r'` just as a habit.

In [6]:
## Search for pattern 'iii' in string 'piiig'.
## All of the pattern must match, but it may appear anywhere.
## On success, match.group() is matched text.
match = re.search(r'iii', 'piiig')
match.group()

'iii'

In [7]:
match = re.search(r'igs', 'piiig')
print(match)

None


In [8]:
## . = any char but \n
match = re.search(r'..g', 'piiig')
print(match)

<re.Match object; span=(2, 5), match='iig'>


In [9]:
## \d = digit char, \w = word char
match = re.search(r'\d{2}', 'p123g') # \d\d or \d{2}, for all the degits \d+
match.group()

'12'

In [10]:
match = re.search(r'\w\w\w', '@@abcd!!')
match.group()

'abc'

Repetition
Things get more interesting when you use `+` and `*` to specify repetition in the pattern

`+` -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's

`*` -- 0 or more occurrences of the pattern to its left

`?` -- match 0 or 1 occurrences of the pattern to its left

In [11]:
## i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig')
match

<re.Match object; span=(0, 4), match='piii'>

In [12]:
## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii')
match

<re.Match object; span=(1, 3), match='ii'>

In [13]:
## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx')
match

<re.Match object; span=(2, 9), match='1 2   3'>

In [14]:
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx')
match

<re.Match object; span=(2, 7), match='12  3'>

In [15]:
match = re.search(r'\d\s*\d\s*\d', 'xx123xx')
match

<re.Match object; span=(2, 5), match='123'>

In [16]:
## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar')
match

In [17]:
## but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar')
match

<re.Match object; span=(3, 6), match='bar'>

# Extract emails

In [18]:
import re
str = 'my email is nlp.shwet@gmail.com.'
match = re.search(r'\w+@\w+', str)
if match:
    print(match.group())

shwet@gmail


### Square Brackets
Square brackets can be used to indicate a set of chars, so `[abc]` matches `'a'` or `'b'` or `'c'`. The codes `\w`, `\s` etc. work inside square brackets too with the one exception that dot `(.)` just means a literal dot. 

For the emails problem, the square brackets are an easy way to add `.` and `-` to the set of chars which can appear around the @ with the pattern `r'[\w.-]+@[\w.-]+'` to get the whole email address:

In [19]:
import re
str = 'my email is nlp.shwet@gmail.com'
match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
    print(match.group())

nlp.shwet@gmail.com


### Group Extraction
The "group" feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis () around the username and host in the pattern, like this: `r'([\w.-]+)@([\w.-]+)'`. 

In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. On a successful search, `match.group(1)` is the match text corresponding to the 1st left parenthesis, and `match.group(2)` is the text corresponding to the 2nd left parenthesis. The plain `match.group()` is still the whole match text as usual.

In [20]:
str ='my email is nlp.shwet@gmail.com, and company mail is hello@visionnlp.com'
match = re.search('([\w.-]+)@([\w.-]+)', str)
if match:
    print(match.group())   ## 'nlp.shweta@gmail.com' (the whole match)
    print(match.group(1))  ## 'nlp.shwet' (the username, group 1)
    print(match.group(2))  ## 'gmail.com' (the host, group 2)

nlp.shwet@gmail.com
nlp.shwet
gmail.com


# Findall
`findall()` is probably the single most powerful function in the re module. Above we used `re.search()` to find the first match for a pattern. `findall()` finds all the matches and returns them as a list of strings, with each string representing one match.

In [21]:
text = '''1. I got 90% marks in nlp assignment. I lost 10 because of regex.

another student got 60%'''

pat = r'(?:\d\d)'
match = re.findall(pat, text)
match

['90', '10', '60']

In [22]:
text = ''' 54321 dhfjr 74821 teheoa 39836 Free zip archiver. Creates multi-volume archives. Supports encryption with the AES algorithm. Supports hardware acceleration. High compression. Requires a powerful PC to work with large files.
WinRAR. Cheap archiver. Creates RAR and ZIP archives. Maximum path length is up to 2048 characters. Ability to add text comments to archives. Maximum file size is up to 16 exabytes.
PeaZip. Free file extractor. Multilingual UI. 36376 High compression speed, the years is 2022. 

Supports multi-volume archives. Based on AES 256 encryption. 67635 Doesn’t fully support UTF-8 encoding.
The Unarchiver. Best WinZip alternative. Supports old formats (StuffIt and DiskDoubler) Opens fi'''

In [30]:
pat = r'(?:\d\d\d\d\d)'
re.findall(pat, text)

['54321', '74821', '39836', '36376', '67635']

In [25]:
## Suppose we have a text with many email addresses
str = 'my email is nlp.shwet@gmail.com, and company mail is hello@visionnlp.com'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['nlp.shwet@gmail.com, 'hello@visionnlp.com']
emails

['nlp.shwet@gmail.com', 'hello@visionnlp.com']

In [26]:
for email in emails:
    # do something with each found email string
    print(email)

nlp.shwet@gmail.com
hello@visionnlp.com


In [27]:
str = 'my email is nlp.shwet@gmail.com, and company mail is hello@visionnlp.com'

tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)
#print(tuples)

for tuple in tuples:
    print(tuple[0])
    print(tuple[1])

nlp.shwet
gmail.com
hello
visionnlp.com


# sub

In [28]:
str = 'my email is nlp.shwet@gmail.com, and company mail is hello@visionnlp.com,  xyz@abc.org'
## re.sub(pat, replacement, str) -- returns new string with all replacements,
## \1 is group(1), \2 group(2) in the replacement


print(re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@google.com', str))

my email is nlp.shwet@google.com, and company mail is hello@google.com,  xyz@google.com


In [29]:
str.replace('gmail.com', 'google.com')

'my email is nlp.shwet@google.com, and company mail is hello@visionnlp.com,  xyz@abc.org'