<h1>Regular Expressions</h1>
<li>We already know we can search for substrings within a larger string with the in operator</li>
<li>> "dog" in "my dog is great"</li>
<li>>> True</li>
<li>This has severe limitations, we need to know the exact string, and need to perform additional operations to account for capitalization and punctuation</li>
<li>What if we only the pattern structure of the string we're looking for? Like an email or phone number?</li>
<li>Regular Expressions (regex) allow us to search for general patterns in text data!</li>
<li>For example, a simple email format can be:</li>
<li>> user@email.com</li>
<li>We know in this case we're looking for a pattern "text" + "@" + "text" + ".com"</li>
<li>The <strong>re</strong> library allows us to create specialized pattern strings and then search for matches within text</li>
<li>The primary skill set for regex is understanding the sepcial syntax for these pattern string</li>
<li>Don't feel like you need to memorize these patterns! Focus on understanding how to look up the information</li>
<li>Phone Number</li>
<li>> (555)-555-5555</li>
<li>Regex Pattern</li>
<li>> r"(\d\d\d)-\d\d\d-\d\d\d\d"</li>
<li>> r"(\d{3})-\d{3}-\d{4}"</li>
<li>This series of lectures will first focus on how to use the <strong>re</strong> library to search for patterns within text</li>
<li>Afterwards we will focus on understanding the regex syntax codes</li>

In [2]:
text = "The agent's phone number is 408-555-1234. Call soon!"

In [3]:
'phone' in text

True

In [4]:
import re

In [5]:
pattern = 'phone'

In [6]:
re.search(pattern, text)

<re.Match object; span=(12, 17), match='phone'>

In [7]:
pattern = 'NOT IN TEXT'

In [8]:
re.search(pattern, text)

In [9]:
pattern = 'phone'

In [10]:
match = re.search(pattern, text)

In [11]:
match

<re.Match object; span=(12, 17), match='phone'>

In [12]:
match.span()

(12, 17)

In [13]:
match.start()

12

In [14]:
match.end()

17

In [15]:
text = 'my phone once, my phone twice'

In [16]:
match = re.search('phone', text)

In [17]:
match

<re.Match object; span=(3, 8), match='phone'>

In [18]:
matches = re.findall('phone', text)

In [19]:
matches

['phone', 'phone']

In [20]:
len(matches)

2

In [21]:
for match in re.finditer('phone', text):
    print(match)

<re.Match object; span=(3, 8), match='phone'>
<re.Match object; span=(18, 23), match='phone'>


In [22]:
for match in re.finditer('phone', text):
    print(match.span())

(3, 8)
(18, 23)


In [23]:
for match in re.finditer('phone', text):
    print(match.group())

phone
phone


<h2>Character Identifiers</h2>
<li>\d - A digit</li>
<li>\w - Alphanumeric</li>
<li>\s - White space</li>
<li>\D - A non digit</li>
<li>\W - Non-alphanumeric</li>
<li>\S - Non-whitespace</li>

In [24]:
text = 'My phone number is 408-555-1234'

In [25]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d', text)

In [26]:
phone

<re.Match object; span=(19, 31), match='408-555-1234'>

In [27]:
phone.group()

'408-555-1234'

<h2>Quantifiers</h2>
<li>+ - Occurs one or more times</li>
<li>{3} - Occurs exactly 3 times</li>
<li>{2, 4} - Occurs 2 to 4 times</li>
<li>{3, } - Occurs 3 or more</li>
<li>* - Occurs zero or more imes</li>
<li>? - Once or none</li>

In [28]:
phone = re.search(r'\d{3}-\d{3}-\d{4}', text)

In [29]:
phone

<re.Match object; span=(19, 31), match='408-555-1234'>

In [30]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

In [31]:
results = re.search(phone_pattern, text)

In [32]:
results.group()

'408-555-1234'

In [33]:
results.group(3)

'1234'

<h1>Additional Regex Syntax</h1>

In [35]:
re.search(r'cat|dog', 'The dog is here')

<re.Match object; span=(4, 7), match='dog'>

In [36]:
re.findall(r'at', 'The cat in the hat sat there.')

['at', 'at', 'at']

In [37]:
re.findall(r'...at', 'The cat in the hat sat there.')

['e cat', 'e hat']

In [38]:
re.findall(r'^\d', '1 is a number')

['1']

In [40]:
re.findall(r'\d$', 'The number is 2')

['2']

In [41]:
phrase = 'there are 3 numbers 34 inside 5 this sentence'

In [44]:
pattern = r'[^\d]+'

In [45]:
re.findall(pattern, phrase)

['there are ', ' numbers ', ' inside ', ' this sentence']

In [46]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

In [51]:
clean = re.findall(r'[^!.? ]+', test_phrase)

In [53]:
clean

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

In [54]:
' '.join(clean)

'This is a string But it has punctuation How can we remove it'

In [56]:
clean

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

In [57]:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'

In [62]:
pattern = r'[\w]+-[\w]+'

In [63]:
re.findall(pattern, text)

['hypen-words', 'long-ish']

In [64]:
text = 'Hello would you like some catfish?'
texttwo = 'Hello, would you like to take a catnap?'
textthree = 'Hello, have you seen this caterpillar?'

In [67]:
re.search(r'cat(fish|nap|claw)', textthree)