# Regular Expressions Part 2
### My goto reference is the [Regular Expression How To](http://docs.python.org/3/howto/regex.html) webpage.

## What this notebook covers

1. Groups
1. Match Objects
3. Named Groups
4. Modifying Strings with Regular Expressions

### Groups
* Individual parts of a regular expression can be identified using parentheses **()**. 
* These are then known as *GROUPS*. 
* Groups can be made optional by putting a question mark **(?)** after the group. 


# Examples

In [None]:
testString = """
Br. Chapman died yesterday. Brian Chapman, much beloved, Brian E. Chapman Brian Earl Chapman 
Wendy Webber Chapman Clare 
1234 4321.1234 Clare A Chapman python python.org 
http://python.org www.python.org jython zython Brad Bob cpython brian http://www.python.org perl Perl PERL
https://www.utah.edu

https://www.python.org"""

### Here is a regular expression to match `python` or `perl`
```Python
r"""(p(ython|erl))"""
```
#### This uses a group `ython|erl` within a larger group

In [None]:
e4 = r"""(p(ython|erl))"""
println(matchall(e4, testString))



### Optional Groups
#### Here is a regular expression to match pars of the python web address
#### All groups are optional except for `python`

In [None]:
r1 = r"""(http[s]*://)?(www\.)?(python)(\.org)?"""
for m in matchall(r1, testString)
    println(m)
    println(typeof(m))
    println("**********")
end

## Matched Objects
### `matchall` is useful: it returns the strings we matched. But...
### There is a more powerful way of working with what we've matched
* [`eachmatch`]()
* [`match`]()
* The match object contains methods that describe the attribute of the matched string, for example, the span of the matched string.

In [None]:
m = match(r1, testString)
println(typeof(m))

In [None]:
for m in eachmatch(r1, testString)
    println(m)
    println(m.match)
    println(m[1], " ", m[2], " ", m[3], " ", m[4])
    println("***************")
end

In [None]:
println(m.offset, " ", m.offsets)

In [None]:
range(1,4)

In [None]:
for m in eachmatch(r1, testString)
    println(m)
    for i in range(1,4) 
        if m[i] != nothing
            println(m[i], " ", m.offsets[i])
        end
    end
    println("***************")
end

## Named Groups
* As we've defined groups so far, the individual groups can be accessed through indexing
* Groups can be named as follows:
```python   
    (?P<frame>[0-9]+)
```    

### Named groups can be accessed either through index or name

In [None]:
r1 = r"""(?P<protocol>http[s]*://)?(?P<prefix>www\.)?(?P<name>python)(?P<suffix>\.org)?"""
for d in eachmatch(r1, testString)
    println("d[1]=$(d[1]); d[2]=$(d[2]); d[3]=$(d[3]); d[4]=$(d[4])")
end

In [None]:
for d in eachmatch(r1, testString)
    println("d[:protocol]=$(d[:protocol]); d[:prefix]=$(d[:prefix]); d[:name]=$(d[:name]); d[:suffix]=$(d[:suffix])")
end    


In [None]:
name = r"""((?P<fname>[A-Z][a-z]+)\s((?P<mname>[A-Z][A-Za-z\.]*)\s)?(?P<lname>[A-Z][A-Za-z]+))"""

for n in eachmatch(name, testString)
    println("$(n.match) <$(n.offset), $(n.offset+length(n.match))>")
    println("$(n[:lname]), $(n[:fname])")
    #print ("*"*42)
end

## Modifying Strings with re
* Regular expressions can also be used to modify text
* Here is an example where we are identifying multiple white spaces, including tabs and newlines, and replacing them with a single space.

In [None]:
test = """Brian E.     Chapman\t\n has many bikes, including    a Big  Dummy, which  is probably the\t\t\nbike with   the    best name."""

"""cleans up whitepsaces in text by replacing series of all whitespaces with a single space"""
function cleanText(txt)
    
    txt2 = replace(txt, r"\s+", " """)
    return txt2
end    
println(test)
println("*"^42)
println(cleanText(test))