In [223]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# show all outputs of a cell (such as if df.head() and df.tail() are in the same cell)
#default is 'last_expr'

In [390]:
import re
import regex
# The regex module is the newer version of re module

By default, the regex module performs the same as the re module.  To use the regex module:<br>
regex.findall(regex,text,regex.VERSION1) <br>
regex.findall(regex,text,regex.V1)

## Python functions of the regex module

**match=re.search(regex,string)** <br>Searches for the first match in a string and stops when it finds it.  <br>Returns an object called a _Match_ Object with properties about the match.<br>This object has value 1 or True if a match was found.  None if a match was not found. <br><br>
**match=re.match(regex,string)**<br>
Only checks if the BEGINNING of the string matches the regex

Match objects, produced from functions _match_ and _search_, have the following methods:<br>
**match.start()** returns the starting index of the match <br>
**match.end()** returns the ending index (where the match actuually ends on end-1)<br>
**match.span()** returns (start,end) indices as a tuple<br>
**match.group()** returns the substring that was matched

**matches=re.findall(regex,string)** <br>Returns a list of all matches <br>Does NOT have the above methods (start, end, span group)

In [391]:
text='Birth Day: 7/8/1989, Birth Day: 10/1/1943'
text2="June 24 and then another June 24"

In [392]:
regex.findall('June 24',text2)

['June 24', 'June 24']

In [317]:
regex.search('June 24',text2).span()

(0, 7)

In [248]:
regex.match('June 24',text2).group()

'June 24'

In [397]:
regex.match('and',text2)==1
regex.match('and',text2)==0
regex.match('and',text2)==None

False

False

True

## Python methods of the regex module

Instead of using the above functions to search for text, you can:<br>
1. Use the compile function to create a regex object (regex pattern object) with the regular expression you'd like to use to search the text
2. Use the findall, search, or match methods of the regex object.  Search and match still produce match objects. <br>

**pattern=regex.compile(regular expression, regex.V1)**<br>
**pattern.findall(text)**

In [398]:
text="June 24 and then another June 24"

In [399]:
pattern=regex.compile('June',regex.V1)
pattern.search(text).group()

'June'

## Character Classes

A **character class** or **character set** specifies a set of characters, such that if a single character in the string matches any of them, then it is a match and is returned.  Classes are specified using brackets [].  So in other words, it matches and returns all characters satisfying the conditions specified within the brackets.  So a match occurs if any of the characters specified in the class are found where the class is placed. The order of the characters makes no difference.  A range of characters can also be specified using a hyphen -.  <br>

_[tbj]oy_ would return all words starting with t,b,j and ending with oy: toy, boy, joy<br>
_[A-Za-z]oy_ returns all words with an uppercase or lowercase letter and then having 'oy' <br>
_June [0-9]_  returns all phrases of June x where x=[1,9]<br>
_June [0-9][0-9]_  returns all June dates above June 9

The caret character ^, when used inside a class at the very beginning, indicates "not the following":<br>
**[^a-d]** any character, except a through d

In [401]:
text2="Toys are a joy for boys and girls"
text3="Dates include: June 2, June 3, June 13, etc"

In [357]:
regex.findall('[tbj]oy',text2)

['joy', 'boy']

In [403]:
regex.findall('[A-Za-f]oy',text2)

['Toy', 'boy']

In [402]:
regex.findall('June [^0-2]',text3)

['June 3']

In [255]:
regex.findall('June [0-9][0-9]',text3)

['June 13']

## Metacharacters Inside Character Classes

], \, ^, - are the only metacharacters inside square brackets [] that retain their special value.
 <br>
To escape their meanings, precede them with a backslash: \^, \\, \-, \] <br>
OR to escape ^'s meaning, you can just place it anywhere other than right after the opening bracket <br>
OR to escape ]'s meaning, []x] or [^]x], place right after opening bracket or after a caret. <br>
OR to escape -'s meaning, place right after opening bracket, right before closing bracket, or right after a caret.
All other metacharacters merely retain their literal value and do not need to be escaped via a backslash

In [413]:
text='Birth Day: [7-8-1989] \nBirth Day: [10-1-1943]'

In [452]:
print regex.findall('[^]-]',text) #all characters except ] nor -, thus we are escaping ],-'s metacharacter meanings

['B', 'i', 'r', 't', 'h', ' ', 'D', 'a', 'y', ':', ' ', '[', '7', '8', '1', '9', '8', '9', ',', ' ', 'B', 'i', 'r', 't', 'h', ' ', 'D', 'a', 'y', ':', ' ', '[', '1', '0', '1', '1', '9', '4', '3']


## Negated Characters Inside Character Classes

**[^aB-]** matches any character that is neither a,B, or - (it can't be either of them)<br>
**[^a^B^-]** same as above, so in other words (not any of a, B, or -)<br>

In [429]:
regex.findall('[^aB-]',text)==\
regex.findall('[^a^B^-]',text)

True

## Set Operations for Character Classes

** [[class1] -- [class2]]**  _Subtraction_: Return items from one class, except for all items you've subtracted from that class <br>
** [[class1] || [class2]]**  _Union_: Return items from either class (both classes) <br>
**[[class1] ~~ [class2]]**  _Union - Subtraction_:Return items from both classes, except for items common to both classes <br>
**[[class1] && [class2]]**  _Intersection_ - Return only items common to both classes <br>

The regex module can apply set operations to classes, but not the re module.  <br>
The regex module defaults to performing the same as the re module, so change this default setting via the parameter _regex.V1_ or _regex.VERSION1_.

In [430]:
text='abcdef'

In [431]:
regex.findall('[[a-z]--[aeiou]]',text,regex.V1)

['b', 'c', 'd', 'f']

In [287]:
regex.findall('[[a]||[b]]',text,regex.V1)

['a', 'b']

In [286]:
regex.findall('[[abc]~~[ab]]',text,regex.V1)

['c']

In [290]:
regex.findall('[[abc]&&[ab]]',text,regex.V1)

['a', 'b']

## Shorthand Character Classes

Classes called **shorthand character classes** can be referenced using shorthand, meaning designated symbols and characters that refer to that class. <br>
[\d] ->   [0-9] any digit<br>
[\w] ->   [A-Za-z0-9\_] any letter, digit, or underscore character (alphanumeric character)<br>
[\s] ->   [ \t\r\n\f\v] any whitespace character (tab, space, newline, etc)

[\D] ->   [^0-9] any non-digit<br>
[\W] ->   [^A-Za-z0-9\_] (non-alphanumeric characters)<br>
[\S] ->   [^ \t\r\n\f\v] matches NON "space" characters <br>

Shorthand character classes can be combined like literal characters to form classes<br>
[\d\s]  matches a single character that is either a whitepace or a digit character, <br>

Shorthand character classes can be used inside or outside brackets, though the meaning changes.  <br>
\d\s    matches a single digit followed by a whitespace character <br>

**Negated shorthand classes** <br>
[^\d\s] matches any character that is neither a digit nor a whitespace (it can't be either), so (not d & not s)<br>
[\D\S] matches any character that is either not a digit or not a whitespace, so (not d or not s)

In [439]:
text='Birth Day: 7/8/1989, Birth Day: 10/1/1943'

In [440]:
#match a non-alphanumeric (like /) OR match a non-whitespace character.  
#This actually has the result of including all alpha-numerics, which are non-whitespace, and all whitespace
#which are non-alphanumerics
print regex.findall('[\W\S]',text) 

['B', 'i', 'r', 't', 'h', ' ', 'D', 'a', 'y', ':', ' ', '7', '/', '8', '/', '1', '9', '8', '9', ',', ' ', 'B', 'i', 'r', 't', 'h', ' ', 'D', 'a', 'y', ':', ' ', '1', '0', '/', '1', '/', '1', '9', '4', '3']


In [408]:
#return substrings starting with a capital letter and then whose next letter is alphanumeric
print regex.findall('[A-Z][\w]','Ab ab Bb bb') 

['Ab', 'Bb']


In [419]:
print regex.findall('[^\s\d]',text) 

['B', 'i', 'r', 't', 'h', 'D', 'a', 'y', ':', '/', '/', ',', 'B', 'i', 'r', 't', 'h', 'D', 'a', 'y', ':', '/', '/']


## The dot metacharacter

**.** wildcard metacharacter that matches any character except the newline character <br>
**\\.** escapes the dot metacharacter so that its meaning is taken literally

In [432]:
regex.findall('...\.','cat. dog. 932. abc1') #returns all 4-chara words with a . at the end

['cat.', 'dog.', '932.']

## Anchor Metacharacters -  ^, $, \A,\Z

Anchors are metacharacters that do not match characters, but when placed next to other characters, cause a match to occur for those characters ONLY at certain locations in the text, like the start or end of the text.

The **caret ^** metacharacter, when placed in front of characters we'd like to match on, results in a match occuring only if those characters are found at the beginning of the text (or just after a newline character, if regex.multiline is specified in the code). <br>

The **$** metacharacter, when placed at the end of characters we'd like to match on, results in a match occuring only if those characters are found at the end of the text (or just before a newline character, if regex.multiline is specified in the code).<br>
 
**\A** is the same as caret ^, except it ONLY matches at the very beginning of the text, even if regex.MULTILINE is specified. <br>

**\Z** is the same as $, except it ONLY matches at the very end of the text, even if regex.MULTILINE is specified. <br> Other languages also have \z, but Python does not and its \Z has the meaning as specified.

What if you want to match characters at the start or end of each line, where lines are separated by newline characters (\n)?<br>
To have the caret ^ anchor match at the start of the string, AS WELL AS after each line break (meaning, after each \n):<br> 
To have the $ ^ anchor match at the end of the string, AS WELL AS before each line break (meaning, before each \n): <br>
**regex.MULTILINE** <br>

In [519]:
text='Birth Day: 7/8/1989, Birth Day: 10/1/1943'
regex.findall('^Birth',text)
regex.findall('1943$',text)

['Birth']

['1943']

In [509]:
text='Birth Day: 7/8/1989\nBirth Day: 10/1/1943\n\
Birth Day: 9/8/1955\nBirth Day: 12/7/1971'
print text

Birth Day: 7/8/1989
Birth Day: 10/1/1943
Birth Day: 9/8/1955
Birth Day: 12/7/1971


In [504]:
regex.findall('^Birth',text,regex.MULTILINE)
regex.findall('....$',text,regex.MULTILINE)

['Birth', 'Birth', 'Birth', 'Birth']

['1989', '1943', '1955', '1971']

In [508]:
regex.findall('\ABirth',text,regex.MULTILINE) 
regex.findall('....\Z',text,regex.MULTILINE)
#notice below how regex.MULTILINE makes NO difference

['Birth']

['1971']

In [517]:
text="4 is 4\n4 is more.\n4 can't not be 4"
print text
regex.findall('^4$',text,regex.MULTILINE) 

4 is 4
4 is more.
4 can't not be 4


[]