<h1>Regular Expressions </h1>

<h2> Simple String Preprocessing Functions </h2>

In [1]:
text = "sgtEEEr2020.0"

In [2]:
# Strings have methods for checking "global" string properties
print("1.", text.isalpha())

1. False


In [3]:
# These can also be applied per character
print("2.", [c.isalpha() for c in text])
print(text.isdigit(), "-->", [c.isdigit() for c in text])
print(text.isspace(), "-->", [c.isspace() for c in text])
print(text.islower(), "-->", [c.islower() for c in text])
print(text.isupper(), "-->", [c.isupper() for c in text])

2. [True, True, True, True, True, True, True, False, False, False, False, False, False]
False --> [False, False, False, False, False, False, False, True, True, True, True, False, True]
False --> [False, False, False, False, False, False, False, False, False, False, False, False, False]
False --> [True, True, True, False, False, False, True, False, False, False, False, False, False]
False --> [False, False, False, True, True, True, False, False, False, False, False, False, False]


<p> <span style="color:purple">  check if `c.isnumeric()` by list comprehension. (c for char) </span>

In [4]:
print(text.isnumeric(), "-->", [c.isnumeric() for c in text])

False --> [False, False, False, False, False, False, False, True, True, True, True, False, True]


In [5]:
#this function checks ssn number having the pattern: XXX-XX-XXXX
def is_ssn(s):
    #
   
    parts = s.split('-')
    correct_lengths = [3, 2, 4]
    if len(parts) != len(correct_lengths):
        return False
    for p, n in zip(parts, correct_lengths):
        if not (p.isdigit() and len(p) == n):
            return False
    return True
    #


<h2> Regular Expressions - Setup and Basics </h2>

https://docs.python.org/3/howto/regex.html

<a href="http://www.cbs.dtu.dk/courses/27610/regular-expressions-cheat-sheet-v2.pdf" > cheat sheet </a>

<p> Regular expressions - basic steps
    <ol>
        <li> set a <b>pattern</b> to find </li>
        <li> <b>Compile</b> it into a <i>pattern object </i> </li>
        <li>   Apply  the pattern object  to a (target) string to <b>find </b><i> matches </i> (instances of the pattern within the string</li>
</ol>
    

`targetString = "this is a target text for the pattern_string"`
1. `pattern = 'pattern_string'`
2. `pattern_obj = re.compile(pattern)`
3. `matches = pattern_obj.search(targetString` `search` or other <b>search functions</b><br>
you may want to print matches
`print(matches)`


3. the matches can be querried , printed in many ways: <br>
`print(matches.group())` : the text in string <br>
`print(matches.start())` :  the start index<br>
`print(matches.end())` : the end index<br>
`print(matches.span())` : tuple `(start,end)`

3.b  the <b>matcher object</b> (or pattern object) has many <b>search functions </b>
<br>more matches querry types:<ul>
    <li>`.match(targetString)` - determines if the regex matches at the <b>beginning </b> of the string</li>
    <li>`.search(targetString)`  - Scans through a string , looking for <b>any</b> location where the regex matches</li>
    <li>`.findall(targetString)`- Finds <b>all </b> substrings where regex mathces , returns them as a <b> list </b></li>
    <li>`.finditer(targetString)` - finds <b> all </b> substrings where regex matches , returns as an <b> iterator </b></li>
    </ul>

In [6]:
import re

In [7]:
#1 target string and pattern
text = "this is a target string text that has a lot of text of a few words"
pattern = "few"
#2
pattern_obj = re.compile(pattern)
#3
matches = pattern_obj.search(text)
print("Checking pattern presence with  .search()")
print(matches)
print("Checking if found at the beinning with .match()")
matchesMatch=pattern_obj.match(text)

print(matchesMatch)
print("extracting group from matches")
print(matches.group())
print(type(matches))
print("using the .findall() function")
matchesAll = pattern_obj.findall(text)
print(matchesAll)
print("now with pattern 2 'text'")
pattern2 = "text"
pattern_obj2 = re. compile(pattern2)
matches2All = pattern_obj2.findall(text)
print(matches2All)
print("iterate with finditer() to extract span()")
matches2All= pattern_obj2.finditer(text)
print([w.span() for w in matches2All])

Checking pattern presence with  .search()
<_sre.SRE_Match object; span=(57, 60), match='few'>
Checking if found at the beinning with .match()
None
extracting group from matches
few
<class '_sre.SRE_Match'>
using the .findall() function
['few']
now with pattern 2 'text'
['text', 'text']
iterate with finditer() to extract span()
[(24, 28), (47, 51)]


<h2> Special Caracters  and Patterns</h2>

In [12]:
pattern = r"Cookie"
sequence = "Cokie"
if re.match(pattern, sequence):
  print("Match!")
else: print("Not a match!")

Not a match!


<h3> `[ char seq/list]` character classes </h3>

In [13]:
text = "the great text seq"
p = '[gh]'
pobj = re.compile(p)
print("pobj in an instance of the {} type".format(type(pobj)))
print('with .findall()')
matches = pobj.findall(text)

print(matches)

print('with finditer()')
matches=pobj.finditer(text)
print([c.start() for c in matches])

pobj in an instance of the <class '_sre.SRE_Pattern'> type
with .findall()
['h', 'g']
with finditer()
[1, 4]


In [12]:
text[5]

'r'

<h3> `.`  any char except  \n </h3>

In [10]:
text = "the great text seq"
p = '.'
pobj = re.compile(p)
matches = pobj.findall(text)
print(matches)

['t', 'h', 'e', ' ', 'g', 'r', 'e', 'a', 't', ' ', 't', 'e', 'x', 't', ' ', 's', 'e', 'q']


<h3> `^` caret in [] the complement set</h3>

In [11]:
text2 = "num43"
p = '[^num]'
pobj = re.compile(p)
matches = pobj.findall(text2)
print(matches)
print('count matches with len(): {:}'.format(len(matches)))


['4', '3']
count matches with len(): 2


<h3> `\` backslash as escape char</h3>

In [15]:
text3 = "num[43]"
p = '[\[\]]'
pobj = re.compile(p)
matches = pobj.findall(text3)
print(matches)
print('count matches with len(): {:}'.format(len(matches)))


['[']
count matches with len(): 1


In [17]:
text2="num43"

<h3> special sequences with backslash </h3>

`\d`=`[0-9]`

In [18]:
p='\d'
pobj=re.compile(p)
matches = pobj.findall(text2)
print(matches)

['4', '3']


`\D`=`[^0-9]`

In [19]:
p='\D'
pobj=re.compile(p)
text2='num43'
print(text2)
matches = pobj.findall(text2)
print(matches)

num43
['n', 'u', 'm']


`\s` =spaces (tabs, newlines)

In [20]:
p='\s'
pobj=re.compile(p)
text4='num43 num44 \nnum45'
print(text4)
matches = pobj.findall(text4)
print(matches)

num43 num44 
num45
[' ', ' ', '\n']


`S`= non-spaces

In [21]:
text4='num43 num44 \nnum45'
p='\S'
pobj=re.compile(p)

print(text4)
matches = pobj.findall(text4)
print(matches)

num43 num44 
num45
['n', 'u', 'm', '4', '3', 'n', 'u', 'm', '4', '4', 'n', 'u', 'm', '4', '5']


`\w`=`[a-z A-Z 0-9_]` <br>
`\W`=`[^a-z A-Z 0-9_]`

In [22]:
p='\W'
p2='\w'
pobj=re.compile(p)
pobj2=re.compile(p2)

matches = pobj.findall(text4)
matches2 = pobj2.findall(text4)
print(matches)
print(matches2)

[' ', ' ', '\n']
['n', 'u', 'm', '4', '3', 'n', 'u', 'm', '4', '4', 'n', 'u', 'm', '4', '5']


combined pattern with s

In [18]:
pc= '.\s.'
pobj3= re.compile(pc)
matches2 = pobj3.findall(text4)
print(matches2)

['3 n', ' \nn']


<h3> `^` caret pattern at the beginning of the line</h3>

In [31]:
print(text4)
p4= '^nu*'
pobj3= re.compile(p4)
matches2 = pobj3.findall(text4)
print("\n" , matches2)

num43 num44 
num45

 []


<h3> `$`  pattern at the end of the sequence</h3>

In [20]:
print(text4)
p4= '5$'
pobj3= re.compile(p4)
matches2 = pobj3.findall(text4)
print(matches2)

num43 num44 
num45
['5']


<h3> `*`  previous char 0 or more times</h3>

In [23]:
print(text4)
p4= '44*'
pobj3= re.compile(p4)
matches2 = pobj3.findall(text4)
print(matches2)

NameError: name 'text4' is not defined

<h3> `?`  previous char 0 or once</h3>

In [22]:

p4= '4?4'
pobj3= re.compile(p4)
matches2 = pobj3.findall(text4)
print(matches2)

NameError: name 'text4' is not defined

In [18]:
tel="0544242244"

<p><span style="color:purple" > define patern object for 1 of 2 "4" apply on tel, print the .findall list , define other matches instnce with .search() and print its span with span() </span>

In [30]:
ptrnobj=re.compile("4?4")
matches4s=ptrnobj.findall(tel)
matches4search=ptrnobj.search(tel)

print(matches4s)
print(matches4search.span())

['44', '4', '44']
(2, 4)


<h3> `+`  previous char 1 or more</h3>

In [23]:
p4= '44+'
pobj3= re.compile(p4)
matches2 = pobj3.findall(text4)
print(matches2)

['44']


<h3> `{num}` other Quantifier : number of times</h3>

<h4> exactly = `{num}`

In [32]:
text5='num43 num444 \nnum445'
p4 = '4{2}'
pobj3 = re.compile(p4)
matches3 = pobj3.findall(text5)
print(matches3)

['44', '44']


<h4> `{num,}` num or more

In [34]:
text5='num43 num444 \nnum445'
p3 = '4{2,}'
pobj4= re.compile(p3)
matches4= pobj4.findall(text5)
print(matches4)

['444', '44']
