## Notes on regex syntax:
- regex101.com
- this is a character class: []
- "." means all, similar to "*" in SQL
- inside of a character class:
    - " ^ " means not in, similar to python "!=". example r"[^a]" would  NOT return "apple" if apple was in the list.
    - oustide of a character class, " ^ " means begins with. example r"^a" would return "apple" if apple was in the list.
- use regex101's comment tool to generate comments in your code
- for using Regex with HTML: https://html.spec.whatwg.org/multipage/forms.html#valid-e-mail-address

- match objects

In [3]:
import re

In [4]:
re.findall(r'bc', 'abcd')

['bc']

In [5]:
def show_all_matches(regexes, subject, re_length=6):
    print('Sentence:')
    print()
    print('      {}'.format(subject))
    print()
    print('regexp{} | matches'.format(' ' * (re_length - 6)))
    print(' ------{} | ------'.format(' ' * (re_length - 6)))
    for regexp in regexes:
        fmt = ' {:<%d} | {!r}' % re_length
        matches = re.findall(regexp, subject)
        if len(matches) > 8:
            matches = matches[:8] + ['...']
        print(fmt.format(regexp, matches))

In [6]:
sentence = 'Mary had a little lamb. 1 little lamb. Not 10, not 12, not 22, just one.'

show_all_matches([
    r'a',
    r'm',
    r'M',
    r'Mary',
    r'little',
    r'1',
    r'10',
    r'22'
], sentence)


Sentence:

      Mary had a little lamb. 1 little lamb. Not 10, not 12, not 22, just one.

regexp | matches
 ------ | ------
 a      | ['a', 'a', 'a', 'a', 'a']
 m      | ['m', 'm']
 M      | ['M']
 Mary   | ['Mary']
 little | ['little', 'little']
 1      | ['1', '1', '1']
 10     | ['10']
 22     | ['22']


metacharacter	    matches
<br>
.      = anything
<br>
\w	   = any letter or number
<br>
\W	   = anything that's not a letter or number
<br>
\d	   = any digit
<br>
\D	   = anything that's not a digit
<br>
\s	   = any whitespace character

In [7]:
res = [
    r'\w',
    r'\d',
    r'\s',
    r'.', # matches every character
    r'\.', # a literal period
]
show_all_matches(res, sentence)


Sentence:

      Mary had a little lamb. 1 little lamb. Not 10, not 12, not 22, just one.

regexp | matches
 ------ | ------
 \w     | ['M', 'a', 'r', 'y', 'h', 'a', 'd', 'a', '...']
 \d     | ['1', '1', '0', '1', '2', '2', '2']
 \s     | [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '...']
 .      | ['M', 'a', 'r', 'y', ' ', 'h', 'a', 'd', '...']
 \.     | ['.', '.', '.']


## Exercises

### 1. Write a function named is_vowel. It should accept a string as input and use a regular expression to determine if the passed string is a vowel. While not explicity mentioned in the lesson, you can treat the result of re.search as a boolean value that indicates whether or not the regular expression matches the given string.

In [10]:
def is_vowel(word):
    vowels = []
    vowels_bool = []
    for letter in word:
        if re.search(r"^[aeiouAEIOU]$",letter):
            vowels.append(letter)
            vowels_bool.append(True)
    return vowels, vowels_bool

In [11]:
is_vowel('banana')

(['a', 'a', 'a'], [True, True, True])

### 2. Write a function named is_valid_username that accepts a string as input. A valid username starts with a lowercase letter, and only consists of lowercase letters, numbers, or the _ character. It should also be no longer than 32 characters. The function should return either True or False depending on whether the passed string is a valid username.

In [13]:
def is_valid_username(string):
    # ^ = begins with
    # specify: begins with lowercase a-z = ^[a-z]
    # \w = [0-9a-zA-Z_] >> but not used in this expression
    # specify operators, use square brackets to delimit letters
    # $ to terminate list evaluation
    return bool(re.search(r"^[a-z][a-z_0-9]{0,31}$", string))

In [14]:
a = 'a' * 35
kyle = 'sensei_36'

In [15]:
is_valid_username(a)

False

In [16]:
is_valid_username('fred_lindsey_92')

True

In [17]:
is_valid_username(kyle)

True

In [19]:
assert is_valid_username('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')==False
assert is_valid_username('codeup') == True
assert is_valid_username('Codeup') == False
assert is_valid_username('codeup123') == True
assert is_valid_username('1codeup') == False

### 3. Write a regular expression to capture phone numbers. It should match all of the following:

- phone_numbers = '(210) 867 5309', '+1 210.867.5309', '867-5309', '210-867-5309'

In [159]:

def capture_phone_numbers(string):
    """

 Expression:       r"(\+1)?\W?(\(?[\d]{3}?\)?\W?)?[\d]{3}[\.\- ][\d]{4}$"

1st Capturing Group: (\+1)?

? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
\+ matches the character + with index 4310 (2B16 or 538) literally (case sensitive)
1 matches the character 1 with index 4910 (3116 or 618) literally (case sensitive)
\W matches any non-word character (equivalent to [^a-zA-Z0-9_])
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)

2nd Capturing Group: (\(?[\d]{3}?\)?\W?)?

? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
\( matches the character ( with index 4010 (2816 or 508) literally (case sensitive)
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
Match a single character present in the list below [\d]
{3}? matches the previous token exactly 3 times
\d matches a digit (equivalent to [0-9])
\) matches the character ) with index 4110 (2916 or 518) literally (case sensitive)
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
\W matches any non-word character (equivalent to [^a-zA-Z0-9_])
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
Match a single character present in the list below [\d]
{3} matches the previous token exactly 3 times
\d matches a digit (equivalent to [0-9])
Match a single character present in the list below [\.\- ]
\. matches the character . with index 4610 (2E16 or 568) literally (case sensitive)
\- matches the character - with index 4510 (2D16 or 558) literally (case sensitive)
  matches the character   with index 3210 (2016 or 408) literally (case sensitive)
Match a single character present in the list below [\d]
{4} matches the previous token exactly 4 times
\d matches a digit (equivalent to [0-9])
$ asserts position at the end of a line
Global pattern flags 
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
    """
    # needs to start with number, +, or (
    # needs to tolerate additional . , - , or " "
    #begins with '+', '(', or 0-9
    # leads to " ", 0-9, ".", "+", ")"
    # length: 0-15 characters
    # \w = any alphanumeric
    # \W = any non-alphanumeric
    return bool(re.search(
        r"(\+1)?\W?(\(?[\d]{3}?\)?\W?)?[\d]{3}[\.\- ][\d]{4}$", 
        string)), string
            


In [160]:
capture_phone_numbers('+1 210.867.5309')

(True, '+1 210.867.5309')

In [161]:
capture_phone_numbers('(210) 867 5309')

(True, '(210) 867 5309')

In [162]:
capture_phone_numbers('867-5309')

(True, '867-5309')

### 4. Use regular expressions to convert the dates below to the standardized year-month-day format.

In [25]:
dates = ['02/04/19', '02/05/19', '02/06/19', '02/07/19', '02/08/19', '02/09/19', '02/10/19']

In [26]:
# REGEX doesn't use 0 indexing....start index from 1, unless you use .groups()
def convert_to_YYMMDD(dates):
    # split on / and return to lists
    reformatted_dates = []
    for date in dates:
        result = re.search(r"^(\d{2})/(\d{2})/(\d{2})$", date)
        reformatted_dates.append (f'{result[3]}/{result[2]}/{result[1]}')
    return reformatted_dates 

In [29]:
#another solution:
dates_new = pd.Series(dates)
dates_new

0    02/04/19
1    02/05/19
2    02/06/19
3    02/07/19
4    02/08/19
5    02/09/19
6    02/10/19
dtype: object

In [33]:
dates_new = dates_new.str.replace(r"(\d{2})/(\d{2})/(\d{2})", r"20\3-\2-\1", regex=True)
dates_new

0    2019-04-02
1    2019-05-02
2    2019-06-02
3    2019-07-02
4    2019-08-02
5    2019-09-02
6    2019-10-02
dtype: object

In [34]:
convert_to_YYMMDD(dates)

['19/04/02',
 '19/05/02',
 '19/06/02',
 '19/07/02',
 '19/08/02',
 '19/09/02',
 '19/10/02']

### 5. Write a regex to extract the various parts of these logfile lines:

In [35]:
logs = ['GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58',
'POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58',
'GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58']


In [36]:
# use split 
def get_logfile_parts(logs):
   #create a list of dictionaries to put the generated dicts into
    logfile_dict_list = []
   # write a loop to deal with the logs:
    for string in logs:
      # split on the spaces, in each string
       split = re.split(r" ", string)
       # send the split strings into a dictionary, by their split index
       logfile_dict = {'method': split[0], 'path': split[1], 'access_DTG': split[2], 
       'protocol': split[3], 'status': split[4], 'bytes_transferred': split[5], 
       'headers': split[6], 'ip': split[7]}
       # put each dictionary into a master list to hold the values
       logfile_dict_list.append(logfile_dict)
   #return the master dict outside of the loop
    return logfile_dict_list

In [37]:
get_logfile_parts(logs)

[{'method': 'GET',
  'path': '/api/v1/sales?page=86',
  'access_DTG': '[16/Apr/2019:193452+0000]',
  'protocol': 'HTTP/1.1',
  'status': '{200}',
  'bytes_transferred': '510348',
  'headers': '"python-requests/2.21.0"',
  'ip': '97.105.19.58'},
 {'method': 'POST',
  'path': '/users_accounts/file-upload',
  'access_DTG': '[16/Apr/2019:193452+0000]',
  'protocol': 'HTTP/1.1',
  'status': '{201}',
  'bytes_transferred': '42',
  'headers': '"User-Agent:',
  'ip': 'Mozilla/5.0'},
 {'method': 'GET',
  'path': '/api/v1/items?page=3',
  'access_DTG': '[16/Apr/2019:193453+0000]',
  'protocol': 'HTTP/1.1',
  'status': '{429}',
  'bytes_transferred': '3561',
  'headers': '"python-requests/2.21.0"',
  'ip': '97.105.19.58'}]

In [38]:
cool_df = pd.DataFrame(get_logfile_parts(logs))
cool_df

Unnamed: 0,method,path,access_DTG,protocol,status,bytes_transferred,headers,ip
0,GET,/api/v1/sales?page=86,[16/Apr/2019:193452+0000],HTTP/1.1,{200},510348,"""python-requests/2.21.0""",97.105.19.58
1,POST,/users_accounts/file-upload,[16/Apr/2019:193452+0000],HTTP/1.1,{201},42,"""User-Agent:",Mozilla/5.0
2,GET,/api/v1/items?page=3,[16/Apr/2019:193453+0000],HTTP/1.1,{429},3561,"""python-requests/2.21.0""",97.105.19.58


In [None]:
#another way

#convert strings to a series >> pd.Series('')

#create logfile_re with multiline lableled string
#logfile_re = r""""
# REGEX statement, label using P<http_version> within capture groups 
# """

# then...

#series.str.exract(logfile_re, re.VERBOSE) will turn the product into a dataframe