## Notes on regex syntax:
- regex101.com
- this is a character class: []
- "." means all, similar to "*" in SQL
- inside of a character class:
    - " ^ " means not in, similar to python "!=". example r"[^a]" would  NOT return "apple" if apple was in the list.
    - oustide of a character class, " ^ " means begins with. example r"^a" would return "apple" if apple was in the list.
- use regex101's comment tool to generate comments in your code
- for using Regex with HTML: https://html.spec.whatwg.org/multipage/forms.html#valid-e-mail-address

- match objects

In [1]:
import re

In [3]:
re.findall(r'bc', 'abcd')

['bc']

In [4]:
def show_all_matches(regexes, subject, re_length=6):
    print('Sentence:')
    print()
    print('      {}'.format(subject))
    print()
    print('regexp{} | matches'.format(' ' * (re_length - 6)))
    print(' ------{} | ------'.format(' ' * (re_length - 6)))
    for regexp in regexes:
        fmt = ' {:<%d} | {!r}' % re_length
        matches = re.findall(regexp, subject)
        if len(matches) > 8:
            matches = matches[:8] + ['...']
        print(fmt.format(regexp, matches))

In [5]:
sentence = 'Mary had a little lamb. 1 little lamb. Not 10, not 12, not 22, just one.'

show_all_matches([
    r'a',
    r'm',
    r'M',
    r'Mary',
    r'little',
    r'1',
    r'10',
    r'22'
], sentence)


Sentence:

      Mary had a little lamb. 1 little lamb. Not 10, not 12, not 22, just one.

regexp | matches
 ------ | ------
 a      | ['a', 'a', 'a', 'a', 'a']
 m      | ['m', 'm']
 M      | ['M']
 Mary   | ['Mary']
 little | ['little', 'little']
 1      | ['1', '1', '1']
 10     | ['10']
 22     | ['22']


metacharacter	    matches
<br>
.      = anything
<br>
\w	   = any letter or number
<br>
\W	   = anything that's not a letter or number
<br>
\d	   = any digit
<br>
\D	   = anything that's not a digit
<br>
\s	   = any whitespace character

In [6]:
res = [
    r'\w',
    r'\d',
    r'\s',
    r'.', # matches every character
    r'\.', # a literal period
]
show_all_matches(res, sentence)


Sentence:

      Mary had a little lamb. 1 little lamb. Not 10, not 12, not 22, just one.

regexp | matches
 ------ | ------
 \w     | ['M', 'a', 'r', 'y', 'h', 'a', 'd', 'a', '...']
 \d     | ['1', '1', '0', '1', '2', '2', '2']
 \s     | [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '...']
 .      | ['M', 'a', 'r', 'y', ' ', 'h', 'a', 'd', '...']
 \.     | ['.', '.', '.']


## Exercises

### 1. Write a function named is_vowel. It should accept a string as input and use a regular expression to determine if the passed string is a vowel. While not explicity mentioned in the lesson, you can treat the result of re.search as a boolean value that indicates whether or not the regular expression matches the given string.

In [39]:
def is_vowel(word):
    vowels = []
    vowels_bool = []
    for letter in word:
        if re.search(r"a|e|i|o|u",letter):
            vowels.append(letter)
            vowels_bool.append(True)
    return vowels, vowels_bool

In [40]:
is_vowel('banana')

(['a', 'a', 'a'], [True, True, True])

### 2. Write a function named is_valid_username that accepts a string as input. A valid username starts with a lowercase letter, and only consists of lowercase letters, numbers, or the _ character. It should also be no longer than 32 characters. The function should return either True or False depending on whether the passed string is a valid username.

In [59]:
def is_valid_username(string):
    # ^ = begins with
    # specify: begins with lowercase a-z = ^[a-z]
    # \w = [0-9a-zA-Z_] >> but not used in this expression
    # specify operators, use square brackets to delimit letters
    # $ to terminate list evaluation
    return bool(re.search(r"^[a-z][a-z_0-9]{0,31}$", string))

In [62]:
a = 'a' * 35
kyle = 'sensei_36'

In [60]:
is_valid_username(a)

False

In [56]:
is_valid_username('fred_lindsey_92')

True

In [63]:
is_valid_username(kyle)

True

### 3. Write a regular expression to capture phone numbers. It should match all of the following:

- phone_numbers = '(210) 867 5309', '+1 210.867.5309', '867-5309', '210-867-5309'

In [159]:

def capture_phone_numbers(string):
    # needs to start with number, +, or (
    # needs to tolerate additional . , - , or " "
    #begins with '+', '(', or 0-9
    # leads to " ", 0-9, ".", "+", ")"
    # length: 0-15 characters
    # \w = any alphanumeric
    # \W = any non-alphanumeric
    return bool(re.search(
        r"(\+1)?\W?(\(?[\d]{3}?\)?\W?)?[\d]{3}[\.\- ][\d]{4}$", 
        string)), string
            


In [160]:
capture_phone_numbers('+1 210.867.5309')

(True, '+1 210.867.5309')

In [161]:
capture_phone_numbers('(210) 867 5309')

(True, '(210) 867 5309')

In [162]:
capture_phone_numbers('867-5309')

(True, '867-5309')

### 4. Use regular expressions to convert the dates below to the standardized year-month-day format.

In [164]:
dates = ['02/04/19', '02/05/19', '02/06/19', '02/07/19', '02/08/19', '02/09/19', '02/10/19']

In [174]:
# REGEX doesn't use 0 indexing....start index from 1, unless you use .groups()
def convert_to_YYMMDD(dates):
    # split on / and return to lists
    reformatted_dates = []
    for date in dates:
        result = re.search(r"^(\d{2})/(\d{2})/(\d{2})$", date)
        reformatted_dates.append (f'{result[3]}/{result[2]}/{result[1]}')
    return reformatted_dates 

In [175]:
convert_to_YYMMDD(dates)

['19/04/02',
 '19/05/02',
 '19/06/02',
 '19/07/02',
 '19/08/02',
 '19/09/02',
 '19/10/02']

### 5. Write a regex to extract the various parts of these logfile lines:

In [176]:
logs = ['GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58',
'POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58',
'GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58']


In [183]:
# use split 
def get_logfile_parts(logs):
    logfile_dict_list = []
    for string in logs:
       split = re.split(r" ", string)
       logfile_dict = {'method': split[0], 'path': split[1], 'access_DTG': split[2], 
       'protocol': split[3], 'status': split[4], 'bytes_transferred': split[5], 
       'headers': split[6], 'ip': split[7]}
       logfile_dict_list.append(logfile_dict)
    return logfile_dict_list

In [184]:
get_logfile_parts(logs)

[{'method': 'GET',
  'path': '/api/v1/sales?page=86',
  'access_DTG': '[16/Apr/2019:193452+0000]',
  'protocol': 'HTTP/1.1',
  'status': '{200}',
  'bytes_transferred': '510348',
  'headers': '"python-requests/2.21.0"',
  'ip': '97.105.19.58'},
 {'method': 'POST',
  'path': '/users_accounts/file-upload',
  'access_DTG': '[16/Apr/2019:193452+0000]',
  'protocol': 'HTTP/1.1',
  'status': '{201}',
  'bytes_transferred': '42',
  'headers': '"User-Agent:',
  'ip': 'Mozilla/5.0'},
 {'method': 'GET',
  'path': '/api/v1/items?page=3',
  'access_DTG': '[16/Apr/2019:193453+0000]',
  'protocol': 'HTTP/1.1',
  'status': '{429}',
  'bytes_transferred': '3561',
  'headers': '"python-requests/2.21.0"',
  'ip': '97.105.19.58'}]

In [185]:
cool_df = pd.DataFrame(get_logfile_parts(logs))
cool_df

Unnamed: 0,method,path,access_DTG,protocol,status,bytes_transferred,headers,ip
0,GET,/api/v1/sales?page=86,[16/Apr/2019:193452+0000],HTTP/1.1,{200},510348,"""python-requests/2.21.0""",97.105.19.58
1,POST,/users_accounts/file-upload,[16/Apr/2019:193452+0000],HTTP/1.1,{201},42,"""User-Agent:",Mozilla/5.0
2,GET,/api/v1/items?page=3,[16/Apr/2019:193453+0000],HTTP/1.1,{429},3561,"""python-requests/2.21.0""",97.105.19.58
