# Regular expressions

Regular expressions are a language for matching text patterns. The Python "re" module provides support for regular expressions.

With methods available in ``re`` module, we can define a pattern and search for it in a text. A pattern is a string of charactes and symbols.

Adapted from https://developers.google.com/edu/python/regular-expressions

``re.search(pattern, str)``

The ``re.search()`` method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, the ``.search()`` method returns a match object or None otherwise.
The search proceeds through the string from start to end, stopping at the first match found.
The code ``match = re.search(pat, str)`` stores the search result in a variable named ``match``.
If the search succeeded, we can call``.group()`` method on the match object to retrieve the matching text.
If the search did not succeed, there is no matching text to retrieve since the match object is None.

#### Basic regex patterns:
The symbols used in regular expressions devide in ordinary characters that match themselves exactly, like ``a, 9, Z``,
and special symbols or meta/characters that bear special meaning and do not match themselves: ``. ^ $ * + ? { [ ] \ | ( )``

+ ``.`` (a period) -- matches any single character except newline ``\n``. With ``re.DOTALL`` option we can match newline as well.

+ ``\w`` -- (lowercase w) matches a "word" character: a letter or digit or underscore ``[a-zA-Z0-9_]``. Note that  ``\w`` only matches a single word char, not a whole word. ``\W`` (upper case W) matches any non-word character.
+ ``\b`` -- stands for boundary between word and non-word
+ ``\s`` -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form ``[\n\r\t\f]``. ``\S`` (upper case S) matches any non-whitespace character.
+ ``\t``, ``\n``, ``\r`` -- tab, newline, return
+ ``\d`` -- matches a decimal digit [0-9] 
+ ``^`` = start, ``$`` = end -- match the start or end of the string. With ``re.MULTILINE`` option will also match the end of the newline.
+ ``\`` -- inhibit the "specialness" of a character. So, for example, use ``\.`` to match a period or ``\\`` to match a slash. 

#### Leftmost & Largest
The search method finds the first leftmost match for the pattern, and goes right as far as possible trying to use up as much of the string as possible making sure to satisfy the pattern. -- i.e. ``+`` and ``*`` go as far as possible (the ``+`` and ``*`` are said to be "greedy").

#### Repetitions

``+ * ?`` are meta-characters used to specify repetition in the pattern.
* ``+`` -- 1 or more occurrences of the pattern to its left, e.g. ``i+`` = one or more i's
* ``*`` -- 0 or more occurrences of the pattern to its left
* ``?`` -- match 0 or 1 occurrences of the pattern to its left

In [212]:
import re
match = re.search(r"<.+>", r"<div> a paragraph </div>")
if match:
    print(match.group())

<div> a paragraph </div>


``*`` and ``+`` are called *greedy* because they try to include as much of the string as possible. 
To make them *non gready* add a ``?`` at the end, such as ``.*?`` or ``.+?``. Now they stop as soon as they can.

In [213]:
match = re.search(r"<.+?>", r"<div> a paragraph </div>")
if match:
    print(match.group())

<div>


__Note__: The ``r`` at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions.

In [205]:
## @+ = one or more i's, as many as possible.
match = re.search(r'p@+', 'p@@@g') # found, match.group() == "piii"

## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of @'s.
match = re.search(r'@+', 'p@@g@@@') # found, match.group() == "@@"

## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx') # found, match.group() == "1 2   3"
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx') # found, match.group() == "12  3"
match = re.search(r'\d\s*\d\s*\d', 'xx123xx') # found, match.group() == "123"

## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar') # not found, match == None
## but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar') # found, match.group() == "bar"
match.group() 

'bar'

#### Square Brackets
Square brackets is a meta-character used to indicate a set of characters to match. So ``[abc]`` matches either ```a``` or ```b``` or ``c``. 
* Any character that appears inside square brackets will be matched.
* Special characters, such as ``. () * ? ^`` etc. lose their special meaning and are matched literally.
* **Exception:** ``\w \d \s`` work as usual inside ``[]``. 

* ``-`` (dash symbol) is used to indicate a range of characters to match, like ``[a-s0-4]`` will match all letters and numbers in the specified range.  To match the ``-`` literally place it at the end of the set.

* ``^`` (up-hat) at the beginning of a square brackets creates a complementary set. That is, any character **but** those appearing inside ``[]`` will be matched.

In [204]:
match = re.search(r'[\w.-]+@[\w.-]+', 'email address: alice-b@google.com')
if match:
    print(match.group())  

alice-b@google.com


### Pattern extraction with grouping

It may happen that we are looking for some pattern in a text, but we are interested in a specific portion of the matched text. We can use parenthesis ``()`` to wrap parts of the regular expression we want to extract. The parenthesis will not be matched literally, instead they group the match text. We can create several groups in the pattern, and extract any of them using ordinal numbers. Via ``.group()`` method of a regular expression we can pick out parts of the matching text. On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is the whole match text as usual.


 Suppose, given an email we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: ``r'([\w.-]+)@([\w.-]+)``. In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. 

In [208]:
str = 'purple alice3b@google.com monkey dishwasher'
match = re.search(r'([\w.-]+)@([\w.-]+)', str)
if match:
    print(match.group())   ## 'alice3b@google.com' (the whole match)
    print(match.group(1))  ## 'alice3b' (the username, group 1)
    print(match.group(2))  ## 'google.com' (the host, group 2)

alice3b@google.com
alice3b
google.com


``re.findall(pattern, str)``

While ``search()`` finds the first match for a pattern in a string, ``findall()`` finds *all* the matches and returns them as a list of strings. The parenthesis ``()`` group mechanism can be combined with ``findall()``. If the pattern includes 2 or more parenthesis groups, then instead of returning a list of strings, ``findall()`` returns a list of *tuples*.

``(?: )`` indicates a parenthesis group in the pattern which we do not want to extract. The group with ``?:`` will not appear in the result.

In [211]:
## Suppose we have a text with many email addresses
str = 'There are two email adresses: alice@google.com and bob@abc.com .'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'([\w\.-]+)@(?:[\w\.-]+)', str) 
print(emails)

['alice', 'bob']


#### Options
The ``re`` functions take options to modify the behavior of the pattern match. The option flag is added as an extra argument to the ``search()`` or ``findall()``, e.g. ``re.search(pattern, str, re.IGNORECASE)``.

* ``re.IGNORECASE`` -- ignore upper/lowercase differences for matching, so 'a' matches both 'a' and 'A'.
* ``re.DOTALL`` -- allow dot (.) to match newline -- normally it matches anything but newline. 
* ``re.MULTILINE`` -- Within a string made of many lines, this option will allows ``^`` and ``$`` symbols to match the start and end of each line. Normally ``^`` and ``$`` would just match the start and end of the whole string.

In [214]:
print(re.findall(r"\w+y$", "wordy and the speechy\n are tyed pretty"))
print(re.findall(r"\w+y$", "wordy and the speechy\n are tyed pretty", re.MULTILINE))

['pretty']
['speechy', 'pretty']


## Scraping Baby names with REGEX

The Social Security administration keeps yearly record of the most popular names for babies born that year in the USA (see [social security baby names](https://www.ssa.gov/OACT/babynames/)).

The files baby1990.html baby1992.html ... contain raw html.

Implement the ``extract_names(folder, year)`` function which takes the folder of html files, and a year and returns the data from the file as a single list -- the year string at the start of the list followed by the name-rank strings in alphabetical order.

In [7]:
import os
from collections import defaultdict

In [28]:
folder = r"C:\Users\dashb\google-python-exercises\babynames"

In [38]:
for file in os.listdir(folder):
    if re.search("\.html$", file):
        print(file)

baby1990.html
baby1992.html
baby1994.html
baby1996.html
baby1998.html
baby2000.html
baby2002.html
baby2004.html
baby2006.html
baby2008.html


In [124]:
def clean(line):
    """ Remove tags from the text. """
    return re.sub("<[^>]+?>", " ", line).split()

def extract_names(folder, y):
    """Extract a sorted list of popular male and female names in a specific year."""
    
    result = defaultdict(dict)
    
    for file in os.listdir(folder):
        # searching html pages
        if re.search("\.html$", file):
            html_page = os.path.join(folder, file)
            year = int(re.search("\d+", html_page).group())
        
            with open(htmp_page, 'r') as page:
                text = page.read()
                names = re.findall(r'<tr align="right">.+', text)
                for rank, male_name, female_name in  list(map(clean, names)):
                    if male_name not in result[year]:
                        result[year][male_name] = rank
                    if female_name not in result[year]:
                        result[year][female_name] = rank
    
    baby_names = [" ".join(x) for x in sorted(result[y].items(), key=lambda k: k[0])]
    return baby_names

In [226]:
year = 2006
baby_names = extract_names(folder, year)
print(f"List of popular names in {year}:\n", *baby_names[:10], sep =  "\n")

List of popular names in 2006
Aaliyah 91
Aaron 57
Abagail 895
Abbey 695
Abbie 650
Abbigail 490
Abby 205
Abdullah 888
Abel 338
Abigail 6
