# Regular Expressions Part 2
### My goto reference is the [Regular Expression How To](http://docs.python.org/3/howto/regex.html) webpage.

## What this notebook covers

1. Groups
1. Match Objects
3. Named Groups
4. Modifying Strings with Regular Expressions

### Groups
* Individual parts of a regular expression can be identified using parentheses **()**. 
* These are then known as *GROUPS*. 
* Groups can be made optional by putting a question mark **(?)** after the group. 


# Examples

In [1]:
testString = """
Br. Chapman died yesterday. Brian Chapman, much beloved, Brian E. Chapman Brian Earl Chapman 
Wendy Webber Chapman Clare 
1234 4321.1234 Clare A Chapman python python.org 
http://python.org www.python.org jython zython Brad Bob cpython brian http://www.python.org perl Perl PERL
https://www.utah.edu

https://www.python.org"""

"Br. Chapman died yesterday. Brian Chapman, much beloved, Brian E. Chapman Brian Earl Chapman \nWendy Webber Chapman Clare \n1234 4321.1234 Clare A Chapman python python.org \nhttp://python.org www.python.org jython zython Brad Bob cpython brian http://www.python.org perl Perl PERL\nhttps://www.utah.edu\n\nhttps://www.python.org"

### Here is a regular expression to match `python` or `perl`
```Python
r"""(p(ython|erl))"""
```
#### This uses a group `ython|erl` within a larger group

In [2]:
e4 = r"""(p(ython|erl))"""
println(matchall(e4, testString))



SubString{String}["python","python","python","python","python","python","perl","python"]


### Optional Groups
#### Here is a regular expression to match pars of the python web address
#### All groups are optional except for `python`

In [21]:
r1 = r"""(http[s]*://)?(www\.)?(python)(\.org)?"""
for m in matchall(r1, testString)
    println(m)
    println(typeof(m))
    println("**********")
end

python
SubString{String}
**********
python.org
SubString{String}
**********
http://python.org
SubString{String}
**********
www.python.org
SubString{String}
**********
python
SubString{String}
**********
http://www.python.org
SubString{String}
**********
https://www.python.org
SubString{String}
**********


## Matched Objects
### `matchall` is useful: it returns the strings we matched. But...
### There is a more powerful way of working with what we've matched
* [`eachmatch`]()
* [`match`]()
* The match object contains methods that describe the attribute of the matched string, for example, the span of the matched string.

In [22]:
m = match(r1, testString)
println(typeof(m))

RegexMatch


In [28]:
for m in eachmatch(r1, testString)
    println(m)
    println(m.match)
    println(m[1], " ", m[2], " ", m[3], " ", m[4])
    println("***************")
end

RegexMatch("python", 1=nothing, 2=nothing, 3="python", 4=nothing)
python
nothing nothing python nothing
***************
RegexMatch("python.org", 1=nothing, 2=nothing, 3="python", 4=".org")
python.org
nothing nothing python .org
***************
RegexMatch("http://python.org", 1="http://", 2=nothing, 3="python", 4=".org")
http://python.org
http:// nothing python .org
***************
RegexMatch("www.python.org", 1=nothing, 2="www.", 3="python", 4=".org")
www.python.org
nothing www. python .org
***************
RegexMatch("python", 1=nothing, 2=nothing, 3="python", 4=nothing)
python
nothing nothing python nothing
***************
RegexMatch("http://www.python.org", 1="http://", 2="www.", 3="python", 4=".org")
http://www.python.org
http:// www. python .org
***************
RegexMatch("https://www.python.org", 1="https://", 2="www.", 3="python", 4=".org")
https://www.python.org
https:// www. python .org
***************


In [33]:
println(m.offset, " ", m.offsets)

302 [302,310,314,320]


In [34]:
range(1,4)

1:4

In [38]:
for m in eachmatch(r1, testString)
    println(m)
    for i in range(1,4) 
        if m[i] != nothing
            println(m[i], " ", m.offsets[i])
        end
    end
    println("***************")
end

RegexMatch("python", 1=nothing, 2=nothing, 3="python", 4=nothing)
python 154
***************
RegexMatch("python.org", 1=nothing, 2=nothing, 3="python", 4=".org")
python 161
.org 167
***************
RegexMatch("http://python.org", 1="http://", 2=nothing, 3="python", 4=".org")
http:// 173
python 180
.org 186
***************
RegexMatch("www.python.org", 1=nothing, 2="www.", 3="python", 4=".org")
www. 191
python 195
.org 201
***************
RegexMatch("python", 1=nothing, 2=nothing, 3="python", 4=nothing)
python 230
***************
RegexMatch("http://www.python.org", 1="http://", 2="www.", 3="python", 4=".org")
http:// 243
www. 250
python 254
.org 260
***************
RegexMatch("https://www.python.org", 1="https://", 2="www.", 3="python", 4=".org")
https:// 302
www. 310
python 314
.org 320
***************


## Named Groups
* As we've defined groups so far, the individual groups can be accessed through indexing
* Groups can be named as follows:
```python   
    (?P<frame>[0-9]+)
```    

### Named groups can be accessed either through index or name

In [40]:
r1 = r"""(?P<protocol>http[s]*://)?(?P<prefix>www\.)?(?P<name>python)(?P<suffix>\.org)?"""
for d in eachmatch(r1, testString)
    println("d[1]=$(d[1]); d[2]=$(d[2]); d[3]=$(d[3]); d[4]=$(d[4])")
end

d[1]=nothing; d[2]=nothing; d[3]=python; d[4]=nothing
d[1]=nothing; d[2]=nothing; d[3]=python; d[4]=.org
d[1]=http://; d[2]=nothing; d[3]=python; d[4]=.org
d[1]=nothing; d[2]=www.; d[3]=python; d[4]=.org
d[1]=nothing; d[2]=nothing; d[3]=python; d[4]=nothing
d[1]=http://; d[2]=www.; d[3]=python; d[4]=.org
d[1]=https://; d[2]=www.; d[3]=python; d[4]=.org


In [42]:
for d in eachmatch(r1, testString)
    println("d[:protocol]=$(d[:protocol]); d[:prefix]=$(d[:prefix]); d[:name]=$(d[:name]); d[:suffix]=$(d[:suffix])")
end    


d[:protocol]=nothing; d[:prefix]=nothing; d[:name]=python; d[:suffix]=nothing
d[:protocol]=nothing; d[:prefix]=nothing; d[:name]=python; d[:suffix]=.org
d[:protocol]=http://; d[:prefix]=nothing; d[:name]=python; d[:suffix]=.org
d[:protocol]=nothing; d[:prefix]=www.; d[:name]=python; d[:suffix]=.org
d[:protocol]=nothing; d[:prefix]=nothing; d[:name]=python; d[:suffix]=nothing
d[:protocol]=http://; d[:prefix]=www.; d[:name]=python; d[:suffix]=.org
d[:protocol]=https://; d[:prefix]=www.; d[:name]=python; d[:suffix]=.org


In [46]:
name = r"""((?P<fname>[A-Z][a-z]+)\s((?P<mname>[A-Z][A-Za-z\.]*)\s)?(?P<lname>[A-Z][A-Za-z]+))"""

for n in eachmatch(name, testString)
    println("$(n.match) <$(n.offset), $(n.offset+length(n.match))>")
    println("$(n[:lname]), $(n[:fname])")
    #print ("*"*42)
end

Brian Chapman <29, 42>
Chapman, Brian
Brian E. Chapman <58, 74>
Chapman, Brian
Brian Earl Chapman <75, 93>
Chapman, Brian
Wendy Webber Chapman <95, 115>
Chapman, Wendy
Clare A Chapman <138, 153>
Chapman, Clare
Brad Bob <220, 228>
Bob, Brad
Perl PERL <270, 279>
PERL, Perl


## Modifying Strings with re
* Regular expressions can also be used to modify text
* Here is an example where we are identifying multiple white spaces, including tabs and newlines, and replacing them with a single space.

In [49]:
test = """Brian E.     Chapman\t\n has many bikes, including    a Big  Dummy, which  is probably the\t\t\nbike with   the    best name."""

"""cleans up whitepsaces in text by replacing series of all whitespaces with a single space"""
function cleanText(txt)
    
    txt2 = replace(txt, r"\s+", " """)
    return txt2
end    
println(test)
println("*"^42)
println(cleanText(test))

Brian E.     Chapman	
 has many bikes, including    a Big  Dummy, which  is probably the		
bike with   the    best name.
******************************************
Brian E. Chapman has many bikes, including a Big Dummy, which is probably the bike with the best name.


