# NLP - 01 : Regular Expressions

Using a custom function `printDetails(m)`, we use it to view the matches that the regular expression makes as part of its matching.

In [3]:
import re

#Colors: \033[91m (red), \033[92m (green), \033[93m (yellow), \033[94m (blue), \033[95m (pink), \033[39m (turn off)
#span print start to end pos, end pos not inclusive
def printDetails(m):
    if m:
        #print(fFound "{m.group()}" at position {m.span()} in the string: {m.string}')
        print(f'{m.string[0:m.span()[0]]}\033[91m{m.string[m.span()[0]:m.span()[1]]}\033[39m{m.string[m.span()[1]:]}')
    else:
        print('Not found')

txt= "Om Sri Sairam"
x= re.search("sai",txt)
printDetails(x)
if x:
    print(x.span())
    print(x.string)
    print(x.group())

Not found


## Disjunction : `[]`

Disjunction is used to group regular expressions and text as one unit.

1. The possibility of s or S present in the match string at that position.

In [4]:
x= re.search("[sS]ai",txt)
printDetails(x)

Om Sri [91mSai[39mram


2. The possibility of Capital Letters in the match string at that position

In [5]:
x= re.search("[A-Z]ai",txt)
printDetails(x)

Om Sri [91mSai[39mram


3. The possibility of numeric values existing at the position in the match string.

In [6]:
x= re.search("[0-9]ai",txt)
printDetails(x)

Not found


4. The possibility of small alphabets existing at the position in the match string.

In [7]:
x= re.search("[a-z]ai",txt)
printDetails(x)

Not found


5. The possibility of Capital alphabets and alphabets from a-c existing at the position in the match string.

In [8]:
x= re.search("[A-Za-c]ai",txt)
printDetails(x)

Om Sri [91mSai[39mram


6. The possibility of any alphabets existing at the position in the match string.

In [9]:
x= re.search("[A-z]ai",txt)
printDetails(x)

Om Sri [91mSai[39mram


7. Not a working RegEx command

In [10]:
x= re.search("[a-Z]ai",txt)
printDetails(x)

error: bad character range a-Z at position 1

8. The possibility of either s or S

In [14]:
txt= 'Om Sri Sairam'
x= re.search("[s^S]ai",txt)
printDetails(x)

Om Sri [91mSai[39mram


9. Here, `^` is for the use of `NOT` operator.

In [15]:
x= re.search("[^s]ai",txt)
printDetails(x)

Om Sri [91mSai[39mram


10. Not numerics

In [16]:
x= re.search("[^0-9]ai",txt)
printDetails(x)

Om Sri [91mSai[39mram


11. Not from small s to z.

In [17]:
x= re.search("[^s-z]ai",txt)
printDetails(x)

Om Sri [91mSai[39mram


12. "`.`" here is an expression to say any character

In [19]:
x= re.search("[^.]ai",txt)
printDetails(x)

Om Sri [91mSai[39mram


13. '`^`' as a character.

In [20]:
#txt= 'Om Sri rai ram'
x= re.search("[s^]ai",txt)
printDetails(x)

Not found


In [21]:
x= re.search("[S^]ai",txt)
printDetails(x)

Om Sri [91mSai[39mram


14. Starts with S but not a

In [22]:
x= re.search("[S^a]ai",txt)
printDetails(x)

Om Sri [91mSai[39mram


15. Not S followed by ai

In [23]:
#txt= 'Om Sri Sai Ram.Sai'
x= re.search("^Sai",txt)
printDetails(x)

Not found


16. '`?`' means previous part of the expression is optional.

That implies that Sa is necessary but i is not.

In [24]:
#txt= 'Om Sri S'
x= re.search("Sai?",txt)
printDetails(x)

Om Sri [91mSai[39mram


In [25]:
x= re.search("Sa?",txt)
printDetails(x)

Om [91mS[39mri Sairam


Here, S is optional but i is necessary, therefore, RegEx being maximal in nature, we get i in red

In [26]:
#txt= 'Om Sr Si Ram'
print(txt)
x= re.search("S?i",txt)
printDetails(x)

Om Sri Sairam
Om Sr[91mi[39m Sairam


17. Erroneous expression since there is nothing to repeat at position 0

In [27]:
x= re.search("?",txt)
printDetails(x)

error: nothing to repeat at position 0

Here, there is no match since the first part of the match string is \ followed by `?`. 

In [28]:
x= re.search("\?",txt)
printDetails(x)

Not found


18. Matching Whole Words using regex

In [29]:
txt= "baba baabaa baaaabaaa"
x= re.search("baba",txt)
printDetails(x)

[91mbaba[39m baabaa baaaabaaa


19. '`*`': Kleene, which indicates that the part of the expression before `*`, if it exists 0 or more times, then match!

In [30]:
txt= "bb"
x= re.search("ba*ba*",txt)
printDetails(x)

[91mbb[39m


In [31]:
txt= "baabaa baaaabaaa"
x= re.search("baa*baa*",txt)
printDetails(x)

[91mbaabaa[39m baaaabaaa


20. Introducing Function `re.findall()` that takes the match string and the original string and prints all the occurances of it.

In [32]:
txt= "bba baabaa baaaabaaa"
x= re.findall("ba*ba*",txt)
print(x)

['bba', 'baabaa', 'baaaabaaa']


In [33]:
txt= "bba baabaa baaaabaaa"
x= re.findall("baa*baa*",txt)
print(x)

['baabaa', 'baaaabaaa']


In [34]:
txt= "baba baabaa baaaabaaa"
x= re.findall("baaa*baaa*",txt)
print(x)

['baabaa', 'baaaabaaa']


In [35]:
txt= "ggbadef"
x= re.search("b*a*",txt)
print(x)
x= re.findall("b*a*",txt)
print(x)

<re.Match object; span=(0, 0), match=''>
['', '', 'ba', '', '', '', '']


In [36]:
txt= "baba baabaa baaaabaaa"
x= re.search("b*a*",txt)
printDetails(x)
x= re.findall("b*a*",txt)
print(x)

[91mba[39mba baabaa baaaabaaa
['ba', 'ba', '', 'baa', 'baa', '', 'baaaa', 'baaa', '']


In [37]:
txt= "babaaaaaabbbb cccc"
x= re.search("[ba]*",txt)
printDetails(x)
x= re.findall("[ba]*",txt)
print(x)

[91mbabaaaaaabbbb[39m cccc
['babaaaaaabbbb', '', '', '', '', '', '']


In [38]:
txt= "baba aaaaa bbbb cccc"
x= re.search("[ba][ba]*",txt)
printDetails(x)
x= re.findall("[ba][ba]*",txt)
print(x)

[91mbaba[39m aaaaa bbbb cccc
['baba', 'aaaaa', 'bbbb']


In [39]:
txt= "baba aaaaa bbbb cccc"
x= re.search("[ba]+",txt)
printDetails(x)
x= re.findall("[ba]+",txt)
print(x)

[91mbaba[39m aaaaa bbbb cccc
['baba', 'aaaaa', 'bbbb']


In [40]:
txt= "bba bca bsc msc a\na"
print(txt)
x= re.search("b.*a",txt)
printDetails(x)
x= re.findall("b.*a",txt)
print(x)

bba bca bsc msc a
a
[91mbba bca bsc msc a[39m
a
['bba bca bsc msc a']


21. '`+`' Similar to Kleene(`*`), but with the ability to check for 1 or more times instead of 0 times.

In [41]:
#Find the statement with two cats
txt= "first line \n dog cats dog cats dog second line \n third line cats \n fourth line cats dog catssssd"
x= re.search("cats+",txt)
printDetails(x)
x= re.findall("cats+",txt)
print(x)

first line 
 dog [91mcats[39m dog cats dog second line 
 third line cats 
 fourth line cats dog catssssd
['cats', 'cats', 'cats', 'cats', 'catssss']


In [42]:
#Find the statement with two cats
txt= "first line \n dog catscats dog dog second line \n third line cats \n fourth line cats dog catssssd"
#txt= "first line \n dog catscats dog dog second line \n third line cats cats fourth line cats dog catssssd"
x= re.search("cats.+cats",txt)
printDetails(x)
x= re.findall("cats.+cats",txt)
print(x)

first line 
 dog catscats dog dog second line 
 third line cats 
 fourth line [91mcats dog cats[39msssd
['cats dog cats']


In [43]:
#Find the statement with two cats
txt= "first line \n dog catscats dog dog second line \n third line cats \n fourth line cats dog catssssd"
#txt= "first line \n dog catscats dog dog second line \n third line cats cats fourth line cats dog catssssd"
x= re.search("cats.*cats",txt)
printDetails(x)
x= re.findall("cats.*cats",txt)
print(x)

first line 
 dog [91mcatscats[39m dog dog second line 
 third line cats 
 fourth line cats dog catssssd
['catscats', 'cats dog cats']


In [44]:
txt= " The first line.\n The second line.\nThe The third line.\nThe fourth line. "
x= re.search("^The",txt, flags=re.MULTILINE)
printDetails(x)
x= re.findall("^The",txt, flags=re.MULTILINE)
print(x)

 The first line.
 The second line.
[91mThe[39m The third line.
The fourth line. 
['The', 'The']


22. '`$`' is used to find lines that end with the element that exists before the occurance of `$`.

In [45]:
txt= " The first line\n The second line.\nThe third line\nThe fourth line"
x= re.search("first line$",txt, flags=re.MULTILINE)
printDetails(x)
x= re.findall("first line$",txt, flags=re.MULTILINE)
print(x)

 The [91mfirst line[39m
 The second line.
The third line
The fourth line
['first line']


In [46]:
txt= " The first line.\n The second line.\nThe third line.\nThe fourth line. "
x= re.search(".$",txt, flags=re.MULTILINE)
printDetails(x)
x= re.findall(".$",txt, flags=re.MULTILINE)
print(x)

 The first line[91m.[39m
 The second line.
The third line.
The fourth line. 
['.', '.', '.', ' ']


In [47]:
txt= " The first line.\n The second line.\nThe third line.\nThe fourth line. "
x= re.search("\.$",txt, flags=re.MULTILINE)
printDetails(x)
x= re.findall("\.$",txt, flags=re.MULTILINE)
print(x)

 The first line[91m.[39m
 The second line.
The third line.
The fourth line. 
['.', '.', '.']


In [49]:
txt= "there is the. other in theirs"
x= re.search("the",txt)
printDetails(x)
x= re.findall("the",txt)
print(x)

[91mthe[39mre is the. other in theirs
['the', 'the', 'the', 'the']


In [50]:
txt= "there is the other in theirs"
print(txt)
x= re.search(r"\bthe\b",txt)
printDetails(x)
x= re.findall("\\bthe\\b",txt)
print(x)

there is the other in theirs
there is [91mthe[39m other in theirs
['the']


23. Word Boundary using '`\b`'

In [51]:
txt= "his reg no is 99. his rank is 199. he bought a shirt for Rs. 299 which in dollars is $99"
x= re.search("\\b99\\b",txt)
printDetails(x)
x= re.findall("\\b99\\b",txt)
print(x)

his reg no is [91m99[39m. his rank is 199. he bought a shirt for Rs. 299 which in dollars is $99
['99', '99']


In [52]:
txt= "his reg no is 99. his rank is 199. he bought a shirt for Rs. 299 which in dollars is $99"
x= re.finditer("\\b99\\b",txt)
for m in x:
    printDetails(m)

his reg no is [91m99[39m. his rank is 199. he bought a shirt for Rs. 299 which in dollars is $99
his reg no is 99. his rank is 199. he bought a shirt for Rs. 299 which in dollars is $[91m99[39m


In [53]:
#Find all cats or dogs. Why does'nt this work?
txt= "This line has none. \nThis line has a cat.\nThis line has a dog.\nThis line has a cat and a dog."
x= re.finditer("[catdog]",txt)
for m in x:
    printDetails(m)

This line h[91ma[39ms none. 
This line has a cat.
This line has a dog.
This line has a cat and a dog.
This line has n[91mo[39mne. 
This line has a cat.
This line has a dog.
This line has a cat and a dog.
This line has none. 
This line h[91ma[39ms a cat.
This line has a dog.
This line has a cat and a dog.
This line has none. 
This line has [91ma[39m cat.
This line has a dog.
This line has a cat and a dog.
This line has none. 
This line has a [91mc[39mat.
This line has a dog.
This line has a cat and a dog.
This line has none. 
This line has a c[91ma[39mt.
This line has a dog.
This line has a cat and a dog.
This line has none. 
This line has a ca[91mt[39m.
This line has a dog.
This line has a cat and a dog.
This line has none. 
This line has a cat.
This line h[91ma[39ms a dog.
This line has a cat and a dog.
This line has none. 
This line has a cat.
This line has [91ma[39m dog.
This line has a cat and a dog.
This line has none. 
This line has a cat.
This line has a [91md

24. '`|`' is used as a disjunction between expressions.

In [54]:
#Find all cats or dogs.
txt= "This line has none. \nThis line has a cat.\nThis line has a dog.\nThis line has a cat and a dog."
x= re.finditer("cat|dog",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
This line has none. 
This line has a [91mcat[39m.
This line has a dog.
This line has a cat and a dog.
1
This line has none. 
This line has a cat.
This line has a [91mdog[39m.
This line has a cat and a dog.
2
This line has none. 
This line has a cat.
This line has a dog.
This line has a [91mcat[39m and a dog.
3
This line has none. 
This line has a cat.
This line has a dog.
This line has a cat and a [91mdog[39m.


In [55]:
#Find all instances of funny or funnier.
txt= "This line is funny. \nThis line is more funny.\nThis line is funnier.\nThis line has no fun but carrier."
x= re.finditer("funny|ier",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
This line is [91mfunny[39m. 
This line is more funny.
This line is funnier.
This line has no fun but carrier.
1
This line is funny. 
This line is more [91mfunny[39m.
This line is funnier.
This line has no fun but carrier.
2
This line is funny. 
This line is more funny.
This line is funn[91mier[39m.
This line has no fun but carrier.
3
This line is funny. 
This line is more funny.
This line is funnier.
This line has no fun but carr[91mier[39m.


In [56]:
#Find all instances of funny or funnier.
txt= "This line is funny. \nThis line is more funny.\nThis line is funnier.\nThis line has no fun but carrier."
x= re.finditer("funn(y|ier)",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
This line is [91mfunny[39m. 
This line is more funny.
This line is funnier.
This line has no fun but carrier.
1
This line is funny. 
This line is more [91mfunny[39m.
This line is funnier.
This line has no fun but carrier.
2
This line is funny. 
This line is more funny.
This line is [91mfunnier[39m.
This line has no fun but carrier.


In [57]:
#Find all instances of Col<int> repeated any number of times.
txt= "This line has Col13. \nThis line has Col13 and Col14.\nOnly Col14.\nCol13 Col14"
x= re.finditer("Col[0-9][0-9]*",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
This line has [91mCol13[39m. 
This line has Col13 and Col14.
Only Col14.
Col13 Col14
1
This line has Col13. 
This line has [91mCol13[39m and Col14.
Only Col14.
Col13 Col14
2
This line has Col13. 
This line has Col13 and [91mCol14[39m.
Only Col14.
Col13 Col14
3
This line has Col13. 
This line has Col13 and Col14.
Only [91mCol14[39m.
Col13 Col14
4
This line has Col13. 
This line has Col13 and Col14.
Only Col14.
[91mCol13[39m Col14
5
This line has Col13. 
This line has Col13 and Col14.
Only Col14.
Col13 [91mCol14[39m


In [58]:
#Find all instances of Col<int> repeated any number of times.
txt= "This line has Col13. \nThis line has Col13 and Col14.\nOnly Col14.\nCol13 Col14"
x= re.finditer("(Col[0-9][0-9])*",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
[91m[39mThis line has Col13. 
This line has Col13 and Col14.
Only Col14.
Col13 Col14
1
T[91m[39mhis line has Col13. 
This line has Col13 and Col14.
Only Col14.
Col13 Col14
2
Th[91m[39mis line has Col13. 
This line has Col13 and Col14.
Only Col14.
Col13 Col14
3
Thi[91m[39ms line has Col13. 
This line has Col13 and Col14.
Only Col14.
Col13 Col14
4
This[91m[39m line has Col13. 
This line has Col13 and Col14.
Only Col14.
Col13 Col14
5
This [91m[39mline has Col13. 
This line has Col13 and Col14.
Only Col14.
Col13 Col14
6
This l[91m[39mine has Col13. 
This line has Col13 and Col14.
Only Col14.
Col13 Col14
7
This li[91m[39mne has Col13. 
This line has Col13 and Col14.
Only Col14.
Col13 Col14
8
This lin[91m[39me has Col13. 
This line has Col13 and Col14.
Only Col14.
Col13 Col14
9
This line[91m[39m has Col13. 
This line has Col13 and Col14.
Only Col14.
Col13 Col14
10
This line [91m[39mhas Col13. 
This line has Col13 and Col14.
Only Col14.
Col13 Col14
11
This line h[91m

In [59]:
#Find all instances of Col<int> repeated any number of times.
txt= "This line has Col13. \nThis line has Col13 and Col14.\nOnly Col14.\nCol13 Col14"
x= re.finditer("(Col[0-9]+ *)+",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
This line has [91mCol13[39m. 
This line has Col13 and Col14.
Only Col14.
Col13 Col14
1
This line has Col13. 
This line has [91mCol13 [39mand Col14.
Only Col14.
Col13 Col14
2
This line has Col13. 
This line has Col13 and [91mCol14[39m.
Only Col14.
Col13 Col14
3
This line has Col13. 
This line has Col13 and Col14.
Only [91mCol14[39m.
Col13 Col14
4
This line has Col13. 
This line has Col13 and Col14.
Only Col14.
[91mCol13 Col14[39m


In [60]:
#Find all instances of Col<int> repeated any number of times.
txt= "This line has Col13. \nThis line has Col13 and Col14.\nOnly Col14.\nCol13 Col14"
x= re.finditer("^(Col[0-9]+ *)+",txt,flags=re.MULTILINE)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
This line has Col13. 
This line has Col13 and Col14.
Only Col14.
[91mCol13 Col14[39m


In [61]:
#Find all instances of Col<int> repeated any number of times.
txt= "This line has Col13. \nThis line has Col13 and Col14.\nOnly Col14.\nCol13 Col14  Col15       Col16"
x= re.finditer("^(Col[0-9]+ *)+",txt,flags=re.MULTILINE)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
This line has Col13. 
This line has Col13 and Col14.
Only Col14.
[91mCol13 Col14  Col15       Col16[39m


In [62]:
#Find all instances of Col<int><int>.
txt= "This line has Col13. \nThis line has Col13 and Col14.\nOnly Col14.\nCol13 Col14"
x= re.finditer("Col[0-9]{2}",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
This line has [91mCol13[39m. 
This line has Col13 and Col14.
Only Col14.
Col13 Col14
1
This line has Col13. 
This line has [91mCol13[39m and Col14.
Only Col14.
Col13 Col14
2
This line has Col13. 
This line has Col13 and [91mCol14[39m.
Only Col14.
Col13 Col14
3
This line has Col13. 
This line has Col13 and Col14.
Only [91mCol14[39m.
Col13 Col14
4
This line has Col13. 
This line has Col13 and Col14.
Only Col14.
[91mCol13[39m Col14
5
This line has Col13. 
This line has Col13 and Col14.
Only Col14.
Col13 [91mCol14[39m


In [63]:
#Find all instances of Col<int><int>.
txt= "This line has Col1. \nThis line has Col3 and Col14.\nOnly Col14.\nCol13 Col14"
x= re.finditer("Col[0-9]{2}",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
This line has Col1. 
This line has Col3 and [91mCol14[39m.
Only Col14.
Col13 Col14
1
This line has Col1. 
This line has Col3 and Col14.
Only [91mCol14[39m.
Col13 Col14
2
This line has Col1. 
This line has Col3 and Col14.
Only Col14.
[91mCol13[39m Col14
3
This line has Col1. 
This line has Col3 and Col14.
Only Col14.
Col13 [91mCol14[39m


In [64]:
#Find all instances of Col<int><int>.
txt= "This line has Col123. \nThis line has Col3 and Col14.\nOnly Col14.\nCol13 Col14"
x= re.finditer("Col[0-9]{1,2}",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
This line has [91mCol12[39m3. 
This line has Col3 and Col14.
Only Col14.
Col13 Col14
1
This line has Col123. 
This line has [91mCol3[39m and Col14.
Only Col14.
Col13 Col14
2
This line has Col123. 
This line has Col3 and [91mCol14[39m.
Only Col14.
Col13 Col14
3
This line has Col123. 
This line has Col3 and Col14.
Only [91mCol14[39m.
Col13 Col14
4
This line has Col123. 
This line has Col3 and Col14.
Only Col14.
[91mCol13[39m Col14
5
This line has Col123. 
This line has Col3 and Col14.
Only Col14.
Col13 [91mCol14[39m


In [65]:
#Find all instances of Col<int><int>.
txt= "This line has Col123. \nThis line has Col3 and Col14.\nOnly Col14.\nCol13 Col14"
x= re.finditer("Col[0-9]{2,}",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
This line has [91mCol123[39m. 
This line has Col3 and Col14.
Only Col14.
Col13 Col14
1
This line has Col123. 
This line has Col3 and [91mCol14[39m.
Only Col14.
Col13 Col14
2
This line has Col123. 
This line has Col3 and Col14.
Only [91mCol14[39m.
Col13 Col14
3
This line has Col123. 
This line has Col3 and Col14.
Only Col14.
[91mCol13[39m Col14
4
This line has Col123. 
This line has Col3 and Col14.
Only Col14.
Col13 [91mCol14[39m


In [68]:
#Find all instances of Col<int><int>.
txt= "This line has Col1{}. \nThis line has Col3 and Col14.\nOnly Col14.\nCol13 Col14 Col"
x= re.search("Col[0-9]{}",txt)
print(x)
printDetails(x)

<re.Match object; span=(14, 20), match='Col1{}'>
This line has [91mCol1{}[39m. 
This line has Col3 and Col14.
Only Col14.
Col13 Col14 Col


In [69]:
#Find all instances of *?.
txt= "This line has *?. \nThis line also has *?."
x= re.finditer("*?",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

error: nothing to repeat at position 0

In [70]:
#Find all instances of *?.
txt= "This line has *?. \nThis line also has *?."
x= re.finditer("\*\?",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
This line has [91m*?[39m. 
This line also has *?.
1
This line has *?. 
This line also has [91m*?[39m.


Precedence Hierarchy: <br>1. Parenthesis <br>2. Counters <br> 3. Sequences <br>4. Disjunction<br>
<i>the*</i> matches zero or more occurrences of e and not the, as in <i>theeee</i><br>
<i>(the)*</i> matches zero or more occurrences of the, as in <i>the the the</i><br>
<i>the|any</i> matches <i>the</i> or <i>any</i> and not <i>theny</i> or <i>thany</i>

In [71]:
#Find all instances of the English article "the".
txt= "The Cheetah is faster than the other animals. \nThe article in English is the."
x= re.finditer("the",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
The Cheetah is faster than [91mthe[39m other animals. 
The article in English is the.
1
The Cheetah is faster than the o[91mthe[39mr animals. 
The article in English is the.
2
The Cheetah is faster than the other animals. 
The article in English is [91mthe[39m.


In [72]:
#Find all instances of the English article "the".
txt= "The Cheetah is faster than the other animals. \nThe article in English is the."
x= re.finditer("[tT]he",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
[91mThe[39m Cheetah is faster than the other animals. 
The article in English is the.
1
The Cheetah is faster than [91mthe[39m other animals. 
The article in English is the.
2
The Cheetah is faster than the o[91mthe[39mr animals. 
The article in English is the.
3
The Cheetah is faster than the other animals. 
[91mThe[39m article in English is the.
4
The Cheetah is faster than the other animals. 
The article in English is [91mthe[39m.


In [73]:
#Find all instances of the English article "the".
txt= "The Cheetah is faster than the other animals. \nThe article in English is the."
x= re.finditer("\b[tT]he\b",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

In [74]:
#Find all instances of the English article "the".
txt= "The Cheetah is faster than the other animals. \nThe article in English is the."
x= re.finditer("\\b[tT]he\\b",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
[91mThe[39m Cheetah is faster than the other animals. 
The article in English is the.
1
The Cheetah is faster than [91mthe[39m other animals. 
The article in English is the.
2
The Cheetah is faster than the other animals. 
[91mThe[39m article in English is the.
3
The Cheetah is faster than the other animals. 
The article in English is [91mthe[39m.


In [75]:
#Find all instances of the English article "the" without \b
txt= "The Cheetah is faster than the other animals. \nThe article in English is the. \n What is there?"
x= re.finditer("[^a-zA-Z][tT]he",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
The Cheetah is faster than[91m the[39m other animals. 
The article in English is the. 
 What is there?
1
The Cheetah is faster than the other animals. [91m
The[39m article in English is the. 
 What is there?
2
The Cheetah is faster than the other animals. 
The article in English is[91m the[39m. 
 What is there?
3
The Cheetah is faster than the other animals. 
The article in English is the. 
 What is[91m the[39mre?


In [76]:
#Find all instances of the English article "the" without \b
txt= "The Cheetah is faster than the other animals. \nThe article in English is the. \n What is there?"
x= re.finditer("[^a-zA-Z][tT]he[^a-zA-Z]",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
The Cheetah is faster than[91m the [39mother animals. 
The article in English is the. 
 What is there?
1
The Cheetah is faster than the other animals. [91m
The [39marticle in English is the. 
 What is there?
2
The Cheetah is faster than the other animals. 
The article in English is[91m the.[39m 
 What is there?


In [77]:
#Find all instances of the English article "the" without \b
txt= "The Cheetah is faster than the other animals. \nThe article in English is the. \n What is there?"
x= re.finditer("(^|[^a-zA-Z])[tT]he[^a-zA-Z]",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)
#Notice the . in red!

0
[91mThe [39mCheetah is faster than the other animals. 
The article in English is the. 
 What is there?
1
The Cheetah is faster than[91m the [39mother animals. 
The article in English is the. 
 What is there?
2
The Cheetah is faster than the other animals. [91m
The [39marticle in English is the. 
 What is there?
3
The Cheetah is faster than the other animals. 
The article in English is[91m the.[39m 
 What is there?


In [78]:
#Find all instances of the English article "the" without \b
txt= "The Cheetah is faster than$the$other animals. \nThe article in English is$the$\n What is there?"
x= re.finditer("(^|[^a-zA-Z])[tT]he[^a-zA-Z]",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
[91mThe [39mCheetah is faster than$the$other animals. 
The article in English is$the$
 What is there?
1
The Cheetah is faster than[91m$the$[39mother animals. 
The article in English is$the$
 What is there?
2
The Cheetah is faster than$the$other animals. [91m
The [39marticle in English is$the$
 What is there?
3
The Cheetah is faster than$the$other animals. 
The article in English is[91m$the$[39m
 What is there?


In [79]:
#Find all instances of the English article "the" without \b
txt= "The Cheetah is faster than the other animals. \nThe article in English is the"
x= re.finditer("(^|[^a-zA-Z])[tT]he[^a-zA-Z]",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)
#Is it finding the last "the"?

0
[91mThe [39mCheetah is faster than the other animals. 
The article in English is the
1
The Cheetah is faster than[91m the [39mother animals. 
The article in English is the
2
The Cheetah is faster than the other animals. [91m
The [39marticle in English is the


In [80]:
#Find all instances of the English article "the" without \b
txt= "The Cheetah is faster than the other animals. \nThe article in English is the"
x= re.finditer("(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
[91mThe [39mCheetah is faster than the other animals. 
The article in English is the
1
The Cheetah is faster than[91m the [39mother animals. 
The article in English is the
2
The Cheetah is faster than the other animals. [91m
The [39marticle in English is the
3
The Cheetah is faster than the other animals. 
The article in English is[91m the[39m


In [81]:
#Help a user buy a computer as follows:
#any machine with at least 6 GHz and 500 GB Hard disk space for less than $1000
#We will look out for price patterns alone, say $199.99
txt= "Price of macbook is $25.34\nDell laptops at $199.99 is for sale!\nLenovo laptops starting from $150 in July, 2023."
x= re.finditer("\$[0-9]{1,3}\.[0-9][0-9]",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
Price of macbook is [91m$25.34[39m
Dell laptops at $199.99 is for sale!
Lenovo laptops starting from $150 in July, 2023.
1
Price of macbook is $25.34
Dell laptops at [91m$199.99[39m is for sale!
Lenovo laptops starting from $150 in July, 2023.


In [82]:
#Help a user buy a computer as follows:
#any machine with at least 6 GHz and 500 GB Hard disk space for less than $1000
#We will look out for price patterns alone, say $199.99
txt= "Price of macbook is $25.34\nDell laptops at $199.99 is for sale!\nLenovo laptops starting from $150 in July, 2023."
x= re.finditer("\$[0-9]{1,3}(\.[0-9][0-9])?",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)

0
Price of macbook is [91m$25.34[39m
Dell laptops at $199.99 is for sale!
Lenovo laptops starting from $150 in July, 2023.
1
Price of macbook is $25.34
Dell laptops at [91m$199.99[39m is for sale!
Lenovo laptops starting from $150 in July, 2023.
2
Price of macbook is $25.34
Dell laptops at $199.99 is for sale!
Lenovo laptops starting from [91m$150[39m in July, 2023.


In [83]:
#Help a user buy a computer as follows:
#any machine with at least 6 GHz and 500 GB Hard disk space for less than $1000
#We will look out for hard disk space of number followed by GB
txt= "Hard disk size of 225GB at $300\nHard disk size of 545    GB at $500 is for sale!\n"
x= re.finditer("\\b[1-9][] *GB",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)
#But how do we ensure it is more than 500 GB??

error: unterminated character set at position 7

In [84]:
#replace a pattern
txt= "Colour color colour"
x= re.finditer("color",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)
print(re.sub("colour","color",txt))
#But how do we ensure it is more than 500 GB??

0
Colour [91mcolor[39m colour
Colour color color


In [85]:
#replace a pattern
txt= "123 is before 456\n456 is before 567"
x= re.finditer("[0-9]+",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)
y= re.sub("([0-9]+)",r"<\1>",txt)
print(f'{y}')

0
[91m123[39m is before 456
456 is before 567
1
123 is before [91m456[39m
456 is before 567
2
123 is before 456
[91m456[39m is before 567
3
123 is before 456
456 is before [91m567[39m
<123> is before <456>
<456> is before <567>


In [86]:
#replace a pattern
txt= "The bigger they were, the bigger they will be\nThe bigger they were, the larger they will be"
x= re.finditer("The (.*)er they were, the \\1er they will be",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)
y= re.sub("The (.*)er they were, the \\1er they will be","<\\1>",txt)
print(f'{y}')

0
[91mThe bigger they were, the bigger they will be[39m
The bigger they were, the larger they will be
<bigg>
The bigger they were, the larger they will be


In [87]:
#replace a pattern
txt= "The faster they ran, the faster we ran The faster they ran, the faster we ran"
x= re.finditer("The (.*)er they .* ran, the \\1er we ran",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)
y= re.sub("The (.*)er they (.*), the \\1er we \\2","<\\1><\\2>",txt)
print(f'{y}')

0
[91mThe faster they ran, the faster we ran The faster they ran, the faster we ran[39m
<fast><ran> <fast><ran>


In [88]:
#replace a pattern
txt= "Some cats like some cats\nSome people like some people\nA few cats like some people"
x= re.finditer("(?:[sS]ome|[aA] few) (people|cats) like some \\1",txt)
for i,m in enumerate(x):
    print(i)
    printDetails(m)
y= re.sub("(?:[sS]ome|[aA] few) (people|cats) like some \\1","<\\1>",txt)
print(f'{y}')

0
[91mSome cats like some cats[39m
Some people like some people
A few cats like some people
1
Some cats like some cats
[91mSome people like some people[39m
A few cats like some people
<cats>
<people>
A few cats like some people
