# Exercises on regexes
Relevant documentation: https://docs.python.org/3.5/library/re.html, https://docs.python.org/3.5/howto/regex.html#regex-howto

The exersises are a mix of pre-written code which is ready to be used "as is" and areas where you need to fill in missing code, denoted by "CODE HERE". As you progress through the exercises, you will note that there is less and less pre-written code, meaning you will have to write more code yourself.

In [None]:
import re
from html.parser import HTMLParser
from datetime import date
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import tostring

In [None]:
def text_match(text, pattern):
    return re.search(pattern,  text) is not None

def evaluate_pattern(text_list, pattern, should_match=True):
    for text in text_list:
        match = text_match(text, pattern)
        if not match and should_match:
            print(f"failed to match pattern {pattern} to text {text}")
        elif match and should_match:
            print(f"successfully matched pattern {pattern} to text {text}")
        elif match and not should_match:
            print(f"incorrectly matches pattern {pattern} to text {text}")
        elif not match and not should_match:
            print(f"correctly does not match pattern {pattern} to text {text}")

# Part 1 - regex basics
These first exercises are aimed to train you in the basic usage of regexes. For these exercises the answer can almost directly be found in the course slides or in the documentation linked above. The idea of these exercises is that you aquire an entry-level profiency with operators used in regexes.

## Exercise 1  
Complete the provided Python program that matches a string that has an a followed by zero or more b's.

In [None]:
pattern = CODE HERE

In [None]:
should_match = ["ac","abc","abbc","a","ac"]
should_not_match = ["bc"]

evaluate_pattern(should_match, pattern, True)
evaluate_pattern(should_not_match, pattern, False)

## Exercise 2
Complete the provided Python program that matches a string that has an a followed by one or more b's.

In [None]:
pattern = CODE HERE

In [None]:
should_match = ["abc","abbc"]
should_not_match = ["bc","ac","a"]

evaluate_pattern(should_match, pattern, True)
evaluate_pattern(should_not_match, pattern, False)

## Exercise 3
Complete the provided Python program that matches a string that has an a followed by at least 3 b's.

In [None]:
pattern = CODE HERE

In [None]:
should_match = ["aabbbc"]
should_not_match = ["bc","ac","abc","abbc"]

evaluate_pattern(should_match, pattern, True)
evaluate_pattern(should_not_match, pattern, False)

## Exercise 4
Complete the provided Python program that matches a string that has an a followed by 2 to 4 b's, followed by 1 or more c's. 

In [None]:
CODE HERE

In [None]:
should_match = ["abbc","aabbbc","aabbbcc"]
should_not_match = ["bc","ac","abc","aabbbbbbbc"]

evaluate_pattern(should_match, pattern, True)
evaluate_pattern(should_not_match, pattern, False)

## Exercise 5
Complete the provided Python program that matches a string, that has an a, followed by anything, ending in a b.

In [None]:
CODE HERE

In [None]:
should_match = ["accddbbjjjb","dfracccccdjjjb"]
should_not_match = ["aabbbbd","aabAbbbc"]

evaluate_pattern(should_match, pattern, True)
evaluate_pattern(should_not_match, pattern, False)

## Exercise 6
Complete the provided Python program that matches a string, that starts with an a, followed by anything, ending in a b.

In [None]:
CODE HERE

In [None]:
should_match = ["accddbbjjjb"]
should_not_match = ["aabbbbd","aabAbbbc","dfracccccdjjjb"]

evaluate_pattern(should_match, pattern, True)
evaluate_pattern(should_not_match, pattern, False)

## Exercise 7
Complete the provided Python program below to split a string with multiple delimiters. A delimiter is a chracter or sequence of characters that denotes the boundary between seperate elements in a plain text or data stream. An example is the comma in the csv-format (comma separated values). For this exercise, possible delimiters are ; \S  *  $ <br>
Note that some of these delimiters have special meaning in the regex, so you must escape them with a backslash \

In [None]:
text = 'The quick brown\Sfox jumps*over the lazy dog;huzzah.'
CODE HERE

## Exercise 8
Complete the provided Python program below to filter out and print the numbers (ie. numeric characters, grouped as they are) from the given string.

In [None]:
# Sample string.
text = "Ten 10, Twenty 20, Thirty 30"
result = CODE HERE
# Print results.
for element in result:
    print(element)

## Exercise 9
Complete the provided Python program bellow that matches the word "Isaac " only if followed by a number, and not if followed by anything else.

In [None]:
CODE HERE

In [None]:
should_match = ["Isaac 5","Isaac 259"]
should_not_match = ["Isaac Lastname","Isaac9"]

evaluate_pattern(should_match, pattern, True)
evaluate_pattern(should_not_match, pattern, False)

# Part 2 - use cases
These exercises will guide you through some of the real word use cases of regexes.

## Exercise 10
Complete the provided Python program to remove leading zeroes from an IP adress.

In [None]:
ip = "216.08.094.196"

CODE HERE

assert(string) == "216.8.94.196"

print(string)

## Exercise 11
Complete the provided Python program to extract year, month and date from the provided url. The format in the url will be yyyy/mm/dd, or yyyy/m/d, and convert it to a Python date object

In [None]:
url1= "https://www.washingtonpost.com/news/football-insider/wp/2016/12/24/odell-beckhams-fame-rests-on-one-stupid-little-ball-josh-norman-tells-author/"
url2= "https://www.washingtonpost.com/news/football-insider/wp/2016/9/2/odell-beckhams-fame-rests-on-one-stupid-little-ball-josh-norman-tells-author/"

CODE HERE

def extract_date(url, pattern):
    match_obj = re.search(pattern, url)
    if match_obj is not None:
        date_object = date(*[int(i) for i in match_obj.groups()])
        return date_object

assert extract_date(url1, pattern) == date(2016,12,24)
assert extract_date(url2, pattern) == date(2016,9,2)

## Exercise 12
Complete the provided Python program below to convert a date from the yyyy-mm-dd format to dd-mm-yyyy

In [None]:
def change_date_format(dt):
        return CODE HERE
dt1 = "2026-01-02"
print("Original date in YYY-MM-DD Format: ",dt1)
print("New date in DD-MM-YYYY Format: ",change_date_format(dt1))

## Exercise 13
Complete the provided Python program bellow to check that a string contains only a certain set of characters, in this case a-z, A-Z and 0-9. The program should return "True" if the checked string only contains allowed characters, and "False" if the string contains any other character. 

In [None]:
def is_allowed_specific_char(string):
CODE HERE

Now, using the program above, check the strings, "ABCDEFabcdef153450", "&@#" and "15@a" against the rule outlined above. The expected results are "True", "False", "False"

In [None]:
assert(is_allowed_specific_char("ABCDEFabcdef123450")) == True
assert(is_allowed_specific_char("*&%@#!}{")) == False
assert(is_allowed_specific_char("15@a")) == False

Tips  
-Note that we are using "search" since we have no info on the expected or allowed length of the string  
-As a reminder "Search" will scan through the string looking for the first location where the regular expression pattern produces a match  
-Consider using one or more negations

## Exercise 14
We will now revist the sitemap xml file from vrt.be containing an overview of published articles which was also used in the first exercise session. Write a Python program to extract all the urls starting with http(s) from the provided file, using a regex.

First the xml file needs to be made ready for use. Tip: think back on the materail covered in week 1.

In [None]:
CODE HERE

Now, use an approriate regex to extract the urls from the xml file. Tip: remember that we want all the urls in the document.

In [None]:
CODE HERE

## Exercise 15
#### Complete the provided Python program below to check if a given password meets certain criteria. For this exercise, a good password consists of at least 12 but not longer then 16 characters, at least one uppercase letter, at least one lower case letter, at least 1 number and a non-alphanumeric character, but it can contain no whitespace characters. This should make the password hard to crack, and as least likely as possible to be remembered by the user.<br>
Hint for real life: use two-factor authentication for the apps that allow it and a password manager that autogenerates long, random passwords <br>
Hint for exercise: use lookahead assertions

In [None]:
CODE HERE



In [None]:
assert(password_check("password")) == False
assert(password_check("passw0rd")) == False
assert(password_check("Passw0rd!!")) == False
assert(password_check("letmeinalready11111!")) == False
assert(password_check("LET ME IN now12!")) == False
assert(password_check("LetInAlready52!")) == True  

print(password_check("password"))
print(password_check("passw0rd"))
print(password_check("Passw0rd!!"))
print(password_check("letmeinalready11111!"))
print(password_check("LET ME IN now12!"))
print(password_check("LetInAlready!5"))