# Regular Expressions
## Examples

#### Example 1: Developing a regular expression Step by Step
Start by a simple expression and add more options until you find what you're looking for.

In [84]:
import re
s = "The theme is another cat in the hat"

# I want to find all the occurences of the word "the".
# I use [Tt] to match both upper case and lower case
re.findall(r"[tT]he", s)

['The', 'the', 'the', 'the']

There are too many results here. Re.findall() finds the two "the" i was looking for, but also it finds "the" in **the**me and ano**the**r.  I add the boundaries option **\b** before and after to look only for "the" as a word.

In [85]:
re.findall(r"\b[tT]he\b", s)

['The', 'the']

This works. I find both "the" in my string.  
What if I want to know which words contain the substring "the"? I add at least one non-whitespace **\S+** to my expression to search for words containing "the".

In [86]:
re.findall(r"\S+[tT]he\S+", s)

['another']

I find "other" and not "theme". Why? Because I used \S on both sides, but "theme" has non-white characters only on the right. I can add to \S the zero-to-many option ***** on both sides

In [87]:
re.findall(r"\S*[tT]he\S*", s)

['The', 'theme', 'another', 'the']

This was too much! Because the non-white spaces are optional, now findall() finds all the "the" in my string. I should use the zero-to-many option on one side first, and then on the other side.

In [88]:
print(re.findall(r"\S+[tT]he\S*", s))
print(re.findall(r"\S*[tT]he\S+", s))

['another']
['theme', 'another']


This is great, but I used two regular expressions. I use the disjunction operator | to tell re.findall() to search for two regular expressions in one shot. 

In [89]:
re.findall(r"\S+[tT]he\S*|\S*[tT]he\S+", s)


['theme', 'another']

That's it!

## Useful examples of regular expressions

### Example 1: find all the prices in the text.

In [90]:
import re
text = "Roku players start at $29.99 up to $79.9, and you enjoy 2 months of Philo, a value of $40."

In [91]:
# The expression is divided in two disjointed expressions:
# The first looks for the dollar sign \$, followed by one or more numbers [0-9]+, 
#    a decimal dot \., and one or more numbers [0-9]+
# The second looks for prices without decimals (dollar sign followed by one or more numbers) \$[0-9]+
print(re.findall("\$[0-9]+\.[0-9]+|\$[0-9]+", text))

['$29.99', '$79.9', '$40']


In [92]:
# Note that if I specify two decimals with \.[0-9][0-9] I might miss some results.
print(re.findall("\$[0-9]+\.[0-9][0-9]|\$[0-9]+", text))

['$29.99', '$79', '$40']


### Example 2: find all email addresses

In [93]:
text = "gianluca.zanella@utsa.edu;@classtopic:data foundations;here-again@anotheremail.biz."

In [94]:
# First version (simple and will work for this class)
# In some cases doesn't work very well. Notice that the second result is not a right address (has a dot)
re.findall(r'[\w.-]+@[\w.-]+', text)

['gianluca.zanella@utsa.edu', 'here-again@anotheremail.biz.']

In [95]:
# Second version that covers pretty much all the possible cases
re.findall(r"\b[\w.!#$%&’*+\/=?^`{|}~-]+@[\w-]+(?:\.[\w-]+)*\b", text)

['gianluca.zanella@utsa.edu', 'here-again@anotheremail.biz']

### Example 3: web sites URLs

In [2]:
import regex as re
text = "Check more on the world wide web (www) at https://developers.google.com/edu/python/regular-expressions and http://www.debuggex.com/cheatsheet/regex/python"

In [97]:
# First version (simple and will work for this class)
# In some cases doesn't work very well. Notice that the second result is not a right address (has a dot)
re.findall(r"http\w?://[\w_./-]+", text)

['https://developers.google.com/edu/python/regular-expressions',
 'http://www.debuggex.com/cheatsheet/regex/python']

In [98]:
# Second version that covers pretty much all the possible cases
re.findall(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", text)

['https://developers.google.com/edu/python/regular-expressions',
 'http://www.debuggex.com/cheatsheet/regex/python']

In [3]:
# Third version: A simple expression for URLs with WWW
re.findall(r"www\.[a-z]+\.[a-z]+", text)

['www.debuggex.com']

In [99]:
# Another version. Online you can find plenty of expressions for URLs.
re.findall(r"[localhost|http|https|ftp|file]+://[\w\S(\.|:|/)]+", text)

['https://developers.google.com/edu/python/regular-expressions',
 'http://www.debuggex.com/cheatsheet/regex/python']

### Example 4: match the entire string

In [100]:
# Sometimes the regular expression returns only a portion of what we are looking for.
# For example, I am looking for prices in this string. The expression asks for one decimal.
# The first price is truncated ($29.9 instead of $29.99)
import re
text = "Roku players start at $29.99 up to $79.9."
print(re.findall("\$[0-9]+\.[0-9]", text))

['$29.9', '$79.9']


In [101]:
# to match the entire price, I need to search for one or more decimals (with the +)
print(re.findall("\$[0-9]+\.[0-9]+", text))

['$29.99', '$79.9']


In [102]:
# Another example of matching the entire string
import re
# I have 4 sequences of chromosomes X and Y (_ in the sequence means missing data). 
# I need to check which sequences are entirely matched by the regular expression
# Example
sequence = 'XYX'
regexpstr = r'XXX*YY*\w*|XY+|XX*' 
re.findall(regexpstr, sequence)
# In this case there is NO entire sequence matching because findall() results do not 
# include the original sequence.

['XY', 'X']

In [103]:
# First sequence
sequence = 'XXXYX_'
rfind = re.findall(regexpstr, sequence)
if not rfind:
    print("No ENTIRE MATCHING. findall() did not find any matching pattern")
else:
    if rfind[0] == sequence:
        print("ENTIRE MATCHING.The sequence {} is entirely matched by {}".format(sequence, rfind[0]))
    else:
        print("No ENTIRE MATCHING. The sequence {} is NOT entirely matched. Findall results in {}".format(sequence, rfind[0]))

ENTIRE MATCHING.The sequence XXXYX_ is entirely matched by XXXYX_


In [104]:
# Second sequence
sequence = 'XXXYYX'
rfind = re.findall(regexpstr, sequence)
if not rfind:
    print("No ENTIRE MATCHING. findall() did not find any matching pattern")
else:
    if rfind[0] == sequence:
        print("ENTIRE MATCHING.The sequence {} is entirely matched by {}".format(sequence, rfind[0]))
    else:
        print("No ENTIRE MATCHING. The sequence {} is NOT entirely matched. Findall results in {}".format(sequence, rfind[0]))

ENTIRE MATCHING.The sequence XXXYYX is entirely matched by XXXYYX


In [105]:
# Third sequence
sequence = 'XXX'
rfind = re.findall(regexpstr, sequence)
if not rfind:
    print("No ENTIRE MATCHING. findall() did not find any matching pattern")
else:
    if rfind[0] == sequence:
        print("ENTIRE MATCHING.The sequence {} is entirely matched by {}".format(sequence, rfind[0]))
    else:
        print("No ENTIRE MATCHING. The sequence {} is NOT entirely matched. Findall results in {}".format(sequence, rfind[0]))

ENTIRE MATCHING.The sequence XXX is entirely matched by XXX


In [106]:
# Fourth sequence
sequence = 'YYYXXX'
rfind = re.findall(regexpstr, sequence)
if not rfind:
    print("No ENTIRE MATCHING. findall() did not find any matching pattern")
else:
    if rfind[0] == sequence:
        print("ENTIRE MATCHING.The sequence {} is entirely matched by {}".format(sequence, rfind[0]))
    else:
        print("No ENTIRE MATCHING. The sequence {} is NOT entirely matched. Findall results in {}".format(sequence, rfind[0]))

No ENTIRE MATCHING. The sequence YYYXXX is NOT entirely matched. Findall results in XXX


In [107]:
sequence = 'YYY'
rfind = re.findall(regexpstr, sequence)
if not rfind:
    print("No ENTIRE MATCHING. findall() did not find any matching pattern")
else:
    if rfind[0] == sequence:
        print("ENTIRE MATCHING.The sequence {} is entirely matched by {}".format(sequence, rfind[0]))
    else:
        print("No ENTIRE MATCHING. The sequence {} is NOT entirely matched. Findall results in {}".format(sequence, rfind[0]))

No ENTIRE MATCHING. findall() did not find any matching pattern


In [108]:
sequence = 'XXXYX_'
rfind = re.findall(regexpstr, sequence)
if not rfind:
    print("No ENTIRE MATCHING. findall() did not find any matching pattern")
else:
    if rfind[0] == sequence:
        print("ENTIRE MATCHING.The sequence {} is entirely matched by {}".format(sequence, rfind[0]))
    else:
        print("No ENTIRE MATCHING. The sequence {} is NOT entirely matched. Findall results in {}".format(sequence, rfind[0]))

ENTIRE MATCHING.The sequence XXXYX_ is entirely matched by XXXYX_
