# Regular Expression Exercises


### 1.  Write a function named ```is_vowel```. It should accept a string as input and use a regular expression to determine if the passed string is a vowel. While not explicity mentioned in the lesson, you can treat the result of ```re.search``` as a boolean value that indicates whether or not the regular expression matches the given string.



In [1]:
import pandas as pd
import re

def is_vowel(input):
    '''
    This function takes in a string and uses a regular expression to determine if the passed string is a vowel.
    '''   
    regex = r'[aeiouAEIOU]'   
    if re.search(regex, input):
        print(f'Letter {input} is a vowel.')
    else:
        print(f'Letter {input} is not a vowel.')
    
    
    

In [2]:
is_vowel('B')

Letter B is not a vowel.


In [3]:
is_vowel('e')

Letter e is a vowel.


In [54]:
# ANOTHER WAY (FROM WALKTHROUGH)
def is_vowel(string):
    """
    returns a boolean value assessing if the passed string is a single vowel
    """
    regex = r'^[aeiou]$'
    return bool(re.search(regex, string.lower()))

is_vowel('B')

False

In [55]:
is_vowel('e')

True

### 2.  Write a function named ```is_valid_username``` that accepts a string as input. A valid username starts with a lowercase letter, and only consists of lowercase letters, numbers, or the ```_``` character. It should also be no longer than 32 characters. The function should return either ```True``` or ```False``` depending on whether the passed string is a valid username.

```is_valid_username('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')
False
is_valid_username('codeup')
True
is_valid_username('Codeup')
False
is_valid_username('codeup123')
True
is_valid_username('1codeup')
False```



In [56]:
import re
def is_valid_username(input):
    '''
    This function takes in a string and uses a regular expression to determine whether the passed string is a valid
    user name. A valid user name starts with a lowercase letter and has only lowercase letters, numbers or the _ character.  
    It should also be no longer than 32 characters.  This functions returns True is valid and False if not valid.
    '''   
    regex = r'^[a-z][a-z0-9_]{,31}$'      
    if re.search(regex, input): 
        print('True')
    else:
        print('False')
        
is_valid_username('codeup123')
    

True


In [57]:
is_valid_username('Codeup')

False


In [58]:
is_valid_username('codeuP123')

False


In [59]:
is_valid_username('codeup_123')

True


In [60]:
is_valid_username('1codeup')

False


In [61]:
is_valid_username('thistestsastringthatislongerthan32letters')

False


In [64]:
# ANOTHER WAY (FROM WALKTHROUGH)
def is_valid_username(string):
    '''
    This function takes in a string and uses a regular expression to determine whether the passed string is a valid
    user name. A valid user name starts with a lowercase letter and has only lowercase letters, numbers or the _ character.  
    It should also be no longer than 32 characters.  This functions returns True is valid and False if not valid.
    ''' 
    regex = r'^[a-z][a-z0-9_]{,31}$'
    return bool(re.search(regex, string))


is_valid_username('codeup')

True

### 3.  Write a regular expression to capture phone numbers. It should match all of the following:

### REVISED AFTER WALKTHROUGH:
- (210) 867 5309                     
- +1 210.867.5309     
- 867-5309     
- 210-867-5309

In [68]:
import pandas as pd
import re

# Create a dataframe of the given phone numbers
df = pd.DataFrame()
df['phone_number'] = ['(210) 867 5309', '+1 210.867.5309', '867-5309', '210-867-5309', '2108675309']
df

Unnamed: 0,phone_number
0,(210) 867 5309
1,+1 210.867.5309
2,867-5309
3,210-867-5309
4,2108675309


In [70]:
# re.compile prepares a regular expression for use ahead of time.
# use capture groups () to extract country code, area code, exchange code and line number
# including re.VERBOSE as the last argument ignores any whitespace in the regular expression. 
phone_regex = re.compile(
'''^
(?P<country_code>\+\d+)?
\D*?
(?P<area_code>\d{3})?
\D*?
(?P<exchange_code>\d{3})
\D*?
(?P<line_number>\d{4})
$''', re.VERBOSE)

In [71]:
phone_regex

re.compile(r'^\n(?P<country_code>\+\d+)?\n\D*?\n(?P<area_code>\d{3})?\n\D*?\n(?P<exchange_code>\d{3})\n\D*?\n(?P<line_number>\d{4})\n$',
re.UNICODE|re.VERBOSE)

In [73]:
# extract captured group information from numbers
df['phone_number'].str.extract(phone_regex)

Unnamed: 0,country_code,area_code,exchange_code,line_number
0,,210.0,867,5309
1,1.0,210.0,867,5309
2,,,867,5309
3,,210.0,867,5309
4,,210.0,867,5309


In [75]:
#concat extracted information to original phone_number dataframe (df)

pd.concat([df, df['phone_number'].str.extract(phone_regex)], axis=1)

Unnamed: 0,phone_number,country_code,area_code,exchange_code,line_number
0,(210) 867 5309,,210.0,867,5309
1,+1 210.867.5309,1.0,210.0,867,5309
2,867-5309,,,867,5309
3,210-867-5309,,210.0,867,5309
4,2108675309,,210.0,867,5309


### 4. Use regular expressions to convert the dates below to the standardized year-month-day format :

        02/04/19, 02/05/19, 02/06/19, 02/07/19, 02/08/19, 02/09/19, 02/10/19

In [77]:
# create list of given dates to loop through
date_list = ['02/04/19', '02/05/19', '02/06/19', '02/07/19', '02/08/19', '02/09/19', '02/10/19']


# compose regex to capture each group separated by forward slashes
date_reg = r'(\d+)/(\d+)/(\d+)'


#create blank list to hold converted dates
new_list = []

# loop through original date_list
for date in date_list:
    
    # append a date that adds '20' to make the year four digits, reorder the captured pieces and add dashes as necessary
    new_list.append(re.sub(date_reg, r'20\3-\1-\2', date))
    
# view corrected list
new_list

['2019-02-04',
 '2019-02-05',
 '2019-02-06',
 '2019-02-07',
 '2019-02-08',
 '2019-02-09',
 '2019-02-10']

#### better way (one-step using sub and list comp)

In [79]:
[re.sub(date_reg, r'20\3-\1-\2', date) for date in date_list]

['2019-02-04',
 '2019-02-05',
 '2019-02-06',
 '2019-02-07',
 '2019-02-08',
 '2019-02-09',
 '2019-02-10']

### 5. Write a regex to extract the various parts of these logfile lines:


GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58

In [82]:
lines = """
GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58
"""

#### What does it ^^^ all mean????

the parts of:
GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
- method: GET
- path: /api/v1/sales?page=86
- timestamp: [16/Apr/2019:193452+0000]
- http version: HTTP/1.1
- status code: {200}
- bytes: 510348
- user agent: "python-requests/2.21.0"
- ip: 97.105.19.58

### REVISED BASED ON WALKTHROUGH EXERCISES

In [83]:
regexp = r'''
^
(?P<method>GET|POST)
\s
(?P<path>[/\w\-\?=]+)
\s
\[(?P<timestamp>.+)\]
\s
(?P<http_version>HTTP/\d+\.\d+)
\s
\{(?P<status_code>\d+)\}
\s
(?P<bytes_out>\d+)
\s
"(?P<user_agent>.+)"
\s
(?P<ip>\d+\.\d+\.\d+\.\d+)
$'''

In [84]:
[re.search(regexp, line, re.VERBOSE).groupdict() for line in lines.strip().split('\n')]

[{'method': 'GET',
  'path': '/api/v1/sales?page=86',
  'timestamp': '16/Apr/2019:193452+0000',
  'http_version': 'HTTP/1.1',
  'status_code': '200',
  'bytes_out': '510348',
  'user_agent': 'python-requests/2.21.0',
  'ip': '97.105.19.58'},
 {'method': 'POST',
  'path': '/users_accounts/file-upload',
  'timestamp': '16/Apr/2019:193452+0000',
  'http_version': 'HTTP/1.1',
  'status_code': '201',
  'bytes_out': '42',
  'user_agent': 'User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
  'ip': '97.105.19.58'},
 {'method': 'GET',
  'path': '/api/v1/items?page=3',
  'timestamp': '16/Apr/2019:193453+0000',
  'http_version': 'HTTP/1.1',
  'status_code': '429',
  'bytes_out': '3561',
  'user_agent': 'python-requests/2.21.0',
  'ip': '97.105.19.58'}]

In [85]:
regex = re.compile(regexp, re.VERBOSE)

df = pd.DataFrame()
df['line'] = lines.strip().split('\n')
df = pd.concat([df, df.line.str.extract(regex)], axis=1)
df

Unnamed: 0,line,method,path,timestamp,http_version,status_code,bytes_out,user_agent,ip
0,GET /api/v1/sales?page=86 [16/Apr/2019:193452+...,GET,/api/v1/sales?page=86,16/Apr/2019:193452+0000,HTTP/1.1,200,510348,python-requests/2.21.0,97.105.19.58
1,POST /users_accounts/file-upload [16/Apr/2019:...,POST,/users_accounts/file-upload,16/Apr/2019:193452+0000,HTTP/1.1,201,42,User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; ...,97.105.19.58
2,GET /api/v1/items?page=3 [16/Apr/2019:193453+0...,GET,/api/v1/items?page=3,16/Apr/2019:193453+0000,HTTP/1.1,429,3561,python-requests/2.21.0,97.105.19.58


### Bonus Exercise

You can find a list of words on your mac at /usr/share/dict/words. Use this file to answer the following questions:


- How many words have at least 3 vowels?
- How many words have at least 3 vowels in a row?
- How many words have at least 4 consonants in a row?
- How many words start and end with the same letter?
- How many words start and end with a vowel?
- How many words contain the same letter 3 times in a row?
- What other interesting patterns in words can you find?

### REVISED BASED ON WALKTHROUGH

In [None]:
#
words = pd.read_csv('/usr/share/dict/words', header=None, squeeze=True).dropna()
words = words.str.lower()