#### Verbose, multiline regexp
Suppose we want to do some capturing in an address of the form<br>
`<optional #><apt num><whitespace><street name>,<city>,<2 character uppercase state code><whitespace><zip>`

In [40]:
addr = re.compile(r"""
        \s*             # possible leading white space
        \#?             # optional, use \ before # to disambiguate from comment #
        \s*             # possible whitespace
        (\d+)           # capture apt number
        \s+             # at least one white space 
        (.*)?,          # capture street name, non-greedy sequence until ',', 
        \s*             # possible whitespace
        (.*)?,          # capture city name, non-greedy sequence until ',', 
        \s*             # possible white space
        ([A-Z]{2})      # capture state code
        \s*             # possible white space
        (\d{5})         # capture zip code
        \s*             # possible t railing whitespace
        $               # end of string
        """, re.VERBOSE)

In [41]:
res = addr.match(' # 25 Infinite Loop,Cupertino,CA 12345')
if res:
    for gr in res.groups():
        print(gr)

25
Infinite Loop
Cupertino
CA
12345


In [42]:
res = addr.match('#25 Infinite Loop,  Cupertino , CA 12345')
if res:
    for gr in res.groups():
        print(gr)

25
Infinite Loop
Cupertino 
CA
12345


#### Naming captured fields

In [43]:
# Can give names to the captured fields for easier access, using ?P in group
named_addr = re.compile(r"""
        \s*             # possible leading white space
        \#?             # optional, use \ before # to disambiguate from comment #
        \s*             # possible whitespace
        (?P<apt>\d+)    # capture apt number
        \s+             # at least one white space 
        (?P<street>.*)?, # capture street name, non-greedy sequence until ',', 
        \s*             # possible whitespace
        (?P<city>.*)?,  # capture city name, non-greedy sequence until ',', 
        \s*             # possible white space
        (?P<state>[A-Z]{2})      # capture state code
        \s*             # possible white space
        (?P<zip>\d{5})  # capture zip code
        \s*             # possible trailing whitespace
        $               # end of string
        """, re.VERBOSE)

In [44]:
res = named_addr.match(' # 10 California Avenue,Palo Alto,CA 94304')
res.groupdict()

{'apt': '10',
 'street': 'California Avenue',
 'city': 'Palo Alto',
 'state': 'CA',
 'zip': '94304'}

#### Suppressing captures

In [45]:
# Can suppress capture using ?: inside group
named_addr = re.compile(r"""
        \s*             # possible leading white space
        \#?             # optional, use \ before # to disambiguate from comment #
        \s*             # possible whitespace
        (?:\d+)         # don't capture apt num
        \s+             # at least one white space 
        (?:.*)?,        # don't capture street 
        \s*             # possible whitespace
        (?P<city>.*)?,  # capture city name, name it as 'city'
        \s*             # possible white space
        (?P<state>[A-Z]{2})      # capture state code, name it as 'state'
        \s*             # possible white space
        (?:\d{5})       # don't capture zip code
        \s*             # possible trailing whitespace
        $               # end of string
        """, re.VERBOSE)

In [46]:
res = named_addr.match(' #10 California Avenue,Palo Alto,CA 94304')
res.groupdict()

{'city': 'Palo Alto', 'state': 'CA'}

**You can, of course, get rid of the () for capture altogether**

In [47]:
named_addr = re.compile(r"""
        \s*             # possible leading white space
        \#?             # optional, use \ before # to disambiguate from comment #
        \s*             # possible whitespace
        \d+             # don't capture apt num
        \s+             # at least one white space 
        .*?,            # don't capture street 
        \s*             # possible whitespace
        (?P<city>.*)?,  # capture city name, name it as 'city'
        \s*             # possible white space
        (?P<state>[A-Z]{2})      # capture state code, name it as 'state'
        \s*             # possible white space
        \d{5}           # don't capture zip code
        \s*             # possible trailing whitespace
        $               # end of string
        """, re.VERBOSE)

In [48]:
res = named_addr.match(' #10 California Avenue,Palo Alto,CA 94304')
res.groupdict()

{'city': 'Palo Alto', 'state': 'CA'}

**But the reason you may want to keep them is you can then turn captures on and 
off as needed** 

#### Back referencing captures using name

In [49]:
# Captured string can be back referenced
backref = re.compile(r"""
            (?P<match1>air)     # capture the string 'air', name it as 'air'
            .*               # greedy
            (?P=match1)         # capture backreference to previous name 'air'
            """, re.VERBOSE)
res = backref.search('cool air or hot air today')
print(res)

<re.Match object; span=(5, 19), match='air or hot air'>


In [50]:
res = backref.search('cool air or hot air')
print(res)

<re.Match object; span=(5, 19), match='air or hot air'>


In [51]:
res = backref.search('cool air or hot')
print(res)

None


#### Using findall and finditer functions to get all matches
- findall constructs the entire list of matches before returning it
- finditer returns one match at a time, on demand, in a Match object

**findall()**

In [52]:
# Example 1-a
res = re.findall(r'\w+','These are the days of miracles and wonders!')
print(res)

['These', 'are', 'the', 'days', 'of', 'miracles', 'and', 'wonders']


In [53]:
# Example 1-b - similar but with split() 
res = re.split(r'\W+','These are the days of miracles and wonders!')
print(res)

['These', 'are', 'the', 'days', 'of', 'miracles', 'and', 'wonders', '']


In [54]:
# Example 2-a
res = re.findall(r'\w+',"I can't believe it")
print(res)

['I', 'can', 't', 'believe', 'it']


In [55]:
# Example 2-b
res = re.findall(r'\S+',"I can't believe it")
print(res)

['I', "can't", 'believe', 'it']


In [56]:
res = re.split(r'\s+', "I can't believe it")
print(res)

['I', "can't", 'believe', 'it']


**finditer()**

In [57]:
# Example 1
iterator = re.finditer(r'\w+','These are the days of miracles and wonders!')
print(iterator)
for match in iterator:
    print(match.group(),':',match.span())

<callable_iterator object at 0x7fcccb0bb010>
These : (0, 5)
are : (6, 9)
the : (10, 13)
days : (14, 18)
of : (19, 21)
miracles : (22, 30)
and : (31, 34)
wonders : (35, 42)


In [58]:
# Example 2
iterator = re.finditer(r'\S+',"I can't believe it")
for match in iterator:
    print(match.group(),':',match.span())

I : (0, 1)
can't : (2, 7)
believe : (8, 15)
it : (16, 18)


### Working with datasets with regexes

#### UCI Auto MPG dataset

In the text file auto-mpg-original.txt there are several fields in each line. 
Of these we want the mpg (first field), cylinders (second field),
the model year (third to last), and car name (last). 
We want to read lines from this file, and write these 
fields out in the following format:
<pre>
"car name",year (19xx),cylinders (int),mpg
</pre>


In [59]:
test_str='18.0   8.   307.0      130.0      3504.      12.0   70.  1.	"chevrolet chevelle malibu"'

car_reg = re.compile(r"""
                \s*                    # skip over leading whitespaces, if any
                (?P<mpg>\d{2}\.\d)     # mpg field is of the form dd.d
                \s*                    # skip white spaces
                (?P<cyl>\d)\.          # cylinders field is of the form d., only want d
                .*                     # skip all intervening stuff
                (?P<yy>\d{2})\.        # year is of form dd., only want dd
                \s*                    # skip whitespaces
                \d\.                   # origin is of the form d.
                .*                     # skip intervening stuff
                (?P<name>".*")         # car name is in double quotes, want double quotes
            """, re.VERBOSE)

In [60]:
res = car_reg.match(test_str)
res.groupdict()

{'mpg': '18.0', 'cyl': '8', 'yy': '70', 'name': '"chevrolet chevelle malibu"'}

In [61]:
res = car_reg.match(test_str)
if res:
    car_dict = res.groupdict()
    keys = ['name','yy','cyl','mpg']
    values = [car_dict[k] for k in keys]
    values[1] = '19' + values[1]
    print(','.join(values))

"chevrolet chevelle malibu",1970,8,18.0


**Notice the string join method above<br>
Iterable for join must have string values, otherwise it won't work**

**Print a few lines**

In [62]:
def my_filter(in_line):
    res = car_reg.match(in_line)
    if res:
        car_dict = res.groupdict()
        keys = ['name','yy','cyl','mpg']
        values = [car_dict[k] for k in keys]
        values[1] = '19' + values[1]
        return ','.join(values) 
    return None

In [63]:
for i,line in enumerate(open("auto-mpg-original.txt")):
    if i > 14:
        break
    out_line = my_filter(line)
    if out_line:
        print(out_line)

"chevrolet chevelle malibu",1970,8,18.0
"buick skylark 320",1970,8,15.0
"plymouth satellite",1970,8,18.0
"amc rebel sst",1970,8,16.0
"ford torino",1970,8,17.0
"ford galaxie 500",1970,8,15.0
"chevrolet impala",1970,8,14.0
"plymouth fury iii",1970,8,14.0
"pontiac catalina",1970,8,14.0
"amc ambassador dpl",1970,8,15.0


**The 5 lines immediately before that for "dodge challenger se" in the file are rejected because the first field '(NA)' doesn't meet the regular expression requirement**

### Exercise 2

Create a function `make_filter`, like the `my_filter` function above, but accepts an optional argument, `make`, with default value `None`. If provided, the function will only return values if the first word of the car name matches `make`. 

Example:
```
[In]: 
test_str='18.0   8.   307.0      130.0      3504.      12.0   70.  1.	"chevrolet chevelle malibu"'

make_filter(test_str, "chevrolet")

[Out]: 
"chevrolet chevelle malibu",1970,8,18.0


In [64]:
def make_filter(in_line, make=None):
    res = car_reg.match(in_line)
    # YOUR CODE HERE
    pass

In [66]:
make = "ford"

for i,line in enumerate(open("auto-mpg-original.txt")):
    if i > 24:
        break
    out_line = make_filter(line, make)
    if out_line:
        print(out_line)

"ford torino",1970,8,17.0
"ford galaxie 500",1970,8,15.0
"ford maverick",1970,6,21.0


**Solutions to Exercises**

In [36]:
ex1_pattern = r'"\s*(\S+)\s*,\s*(\S+)\s*"\s*,\s*(\w+)\s*'
ex1_sub = r'\2,\1,\3@rutgers.edu'

In [65]:
def make_filter(in_line, make=None):
    res = car_reg.match(in_line)
    if res:
        car_dict = res.groupdict()
        if make is not None and car_dict['name'].split()[0].find(make)<0:
            return None
        keys = ['name','yy','cyl','mpg']
        values = [car_dict[k] for k in keys]
        values[1] = '19' + values[1]
        return ','.join(values) 
    return None