# Regular Expressions
Regular expressions (regex) are a language for specifying text search strings. They are particularly useful for searching in texts, when we have a pattern to search for and a corpus of texts to search through. 

Here are some resources that will be helpful for this practice:
*   [Official Python Documentation](https://docs.python.org/3/library/re.html)
*   [An online regex checker](https://regex101.com/) 
*   [A cheatsheet](https://cheatography.com/davechild/cheat-sheets/regular-expressions/)

In [1]:
import re

# Exercise 1
Regular Expressions are in our everyday lives. Whenever we fill in a form using our email and a required password, the code behind it to check whether the email is correct and the password is strong are... Regular Expressions!

In this exercise, you need to define two patterns:
1. A pattern to accept valid emails. Take into account that:
  * Allowed characters in the prefix are letters (a-z), numbers, underscores, periods, and dashes.
  * The last portion of the domain must be at least two characters and maximum 5, for example: .com, .org, .cc
2. A pattern to accept passwords that can have one capital letter, one lower-case letter, one digit and one punctuation character and that are at least 8 characters long.

In [None]:
emails_to_be_accepted = "ona.degibert@upf.edu, abc.def@mail-archive.com, Teresa324@domain.org, john.smith@example.com, aoc@hotmail.uk"
emails_to_be_rejected = "@gmail.com, john@hotmail, lisa!@yahoo.edu, jordan@jordan.jordan, 1234567!@org.org"

email_pattern = r"\b[\w\._-]+@[\w-]+\.[a-z]{2,5}\b"

# This may help you test your pattern
print("Accepted emails: ", re.findall(email_pattern,emails_to_be_accepted))
print("Emails that should be rejected but are currently accepted: ",re.findall(email_pattern,emails_to_be_rejected))
if len(re.findall(email_pattern,emails_to_be_accepted)) == 5 and len(re.findall(email_pattern,emails_to_be_rejected)) == 0:
  print("Success, your pattern works!")
else:
  print("Try improving your pattern...")


Accepted emails:  ['ona.degibert@upf.edu', 'abc.def@mail-archive.com', 'Teresa324@domain.org', 'john.smith@example.com', 'aoc@hotmail.uk']
Emails that should be rejected but are currently accepted:  []
Success, your pattern works!


In [37]:
passwords_to_be_accepted = "Password123!3, Appl3.Pi3, KeS78+Tu"
passwords_to_be_rejected = "Sh0r!"

password_pattern =  r"\b[a-zA-Z\d@$!%*#?+&\.]{8,}\b"

# This may help you test your pattern
print("Accepted passwords: ", re.findall(password_pattern,passwords_to_be_accepted))
print("Passwords that should be rejected but are currently accepted: ",re.findall(password_pattern,passwords_to_be_rejected))

if len(re.findall(password_pattern,passwords_to_be_accepted)) == 3 and len(re.findall(password_pattern,passwords_to_be_rejected)) == 0:
  print("Success, your pattern works!")
else:
  print("Try improving your pattern...")

Accepted passwords:  ['Password123!3', 'Appl3.Pi3', 'KeS78+Tu']
Passwords that should be rejected but are currently accepted:  []
Success, your pattern works!


# Exercise 2

We want to perform some text cleaning to be able to count the number of word occurrences in our text.

You should:
* Lowercase the text
* Using regex:
  * Clean up contractions
  * Remove punctuation signs
  * Remove digits
  * Substitute multiple spaces by only one

Then, split the text into words and write a function to count how many time each word occurs in the form of a dictionary.

In [None]:
import collections

text = '''The Catcher in the Rye, J. D. Salinger (1951)    If you really want to hear about it, the first thing you'll probably want to know
        is where I was born, and what my lousy childhood was like, and how my parents were occupied and all before they had me, and all that
        David Copperfield kind of crap, but I don't feel like going into it, if you want to know the truth.'''

# Note here the use of three single quotes to define a multi-line string

def clean_text(text):
  text = text.lower()
  # Remove contractions
  text_cleaned = re.sub("n't"," not", text)
  text_cleaned = re.sub("'ll"," will", text_cleaned)
  text_cleaned = re.sub("\W", " ", text_cleaned) # remove punctuation
  text_cleaned = re.sub("\d","",text_cleaned) # remove digits
  text_cleaned = re.sub("\s+"," ",text_cleaned) # substitute multiple spaces by only one
  return text_cleaned

def count_words(text):
  text_cleaned = clean_text(text) #Note that you can call functions within functions
  words = re.split("\W",text_cleaned)
  counts = dict()
  for word in words:
      if word not in counts.keys():
        counts[word] = 1
      else:
        counts[word] = counts[word]+1
  return counts

count_words(text)

{'': 1,
 'about': 1,
 'all': 2,
 'and': 4,
 'before': 1,
 'born': 1,
 'but': 1,
 'catcher': 1,
 'childhood': 1,
 'copperfield': 1,
 'crap': 1,
 'd': 1,
 'david': 1,
 'do': 1,
 'feel': 1,
 'first': 1,
 'going': 1,
 'had': 1,
 'hear': 1,
 'how': 1,
 'i': 2,
 'if': 2,
 'in': 1,
 'into': 1,
 'is': 1,
 'it': 2,
 'j': 1,
 'kind': 1,
 'know': 2,
 'like': 2,
 'lousy': 1,
 'me': 1,
 'my': 2,
 'not': 1,
 'occupied': 1,
 'of': 1,
 'parents': 1,
 'probably': 1,
 'really': 1,
 'rye': 1,
 'salinger': 1,
 'that': 1,
 'the': 4,
 'they': 1,
 'thing': 1,
 'to': 3,
 'truth': 1,
 'want': 3,
 'was': 2,
 'were': 1,
 'what': 1,
 'where': 1,
 'will': 1,
 'you': 3}