# Chapter 1: Regular Expressions

In [None]:
#We will use the "re" package for regular expressions
import re



## **DIY: Find years**
Write a function that returns a list of years between a given input string.
Your function should return ['1789', '2009', '1960', '1346', '1982'] in the following example:

In [None]:
text = "August 23; 1789; April 14, 2009; September 12 1960; March 3, 1346, October 2, 1982"

def find_years(text):
    #Takes an input of a string, returns a list
    #Fill this function
            
years_list = find_years(text)
print(years_list)

## **Regular expressions**
A regular expression (regex) defines a search pattern for strings. The search pattern can be anything from a simple character, a fixed string or a complex expression containing special characters describing the pattern. The pattern defined by the regex may match one or several times or not at all for a given string.

Regular expressions can be used to search, edit and manipulate text.

The power of regex comes from being able to match complex strings with simple patterns.

## re.findall()
Findall returns all the non-overlapping matches of patterns in a string.

Let's try to capture all (and only) the years in the following date:

In [None]:
input = "August 23; 1789; April 14, 2009; September 12 1960; March 3, 1346, October 2, 1982"
pattern = "\d{4}"

result = re.findall(???) #Fill the code
print(result)



## __Basic regex operations/patterns__

(Most) Characters match themselves. However, to be able to generate patterns, we need to reserve some symbols (i.e. meta-characters). These **meta-characters** do not match themselves because they have special meanings are: . * + ? ^ \$ \{ \} [ ] | ( ) 


<br>
<li>. (dot) -- matches any single character except newline '\n'
<li>* (star) -- causes the given (preceeding) regex pattern to match 0 or more repetitions. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.
<li>+ (plus) -- causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.
<li>? (q.mark) -- by default, regext patterns are greedy (they match the longest possible string). "?" lets us do non-greedy searches (to match the shortest possible string)
<li>^ (caret) -- matches the start of the string (the string itself is not included)
<li>\$ (dollar) -- matches the end of the string (the string itself is not included)
<br>

Python regex cheatsheet: https://www.dataquest.io/blog/regex-cheatsheet/



### **DIY: A's and B's** 
Write a Python program that matches a string that starts with an **a** followed by **(only) zero or more** **b**'s .



In [None]:
import re
def text_match(text):
        pattern = ??? #Fill the code
        if re.findall(pattern,  text):
                return 'Found a match!'
        else:
                return('Not matched!')

print(text_match("ac")) #Not matched!
print(text_match("abc")) #Not matched!
print(text_match("a")) # Found a match!
print(text_match("ab")) # Found a match!
print(text_match("abb")) # Found a match!
print(text_match("cabb")) #Not matched!

### __What if we need to capture the meta-characters in text?__

Let's try to capture all the years (with the plus sign if they are preceded with) in the following string:

In [None]:
string = "August 23; +++++1789; April 14, +2009; September 12 1960; March 3, 1346, October 2, 1982"
pattern = ??? #Fill the code

result = ??? #Fill the code
print(result)


### __The Task:__ Analyze Spam Emails
In this tutorial, we’ll use the Fraudulent Email Corpus from Kaggle. It contains thousands of phishing emails sent between 1998 and 2007. 

1. Open the file "fradulant_emails_utf8.txt"
2. Collect the list of all senders: Write a regex pattern to find all lines starting with "From:" 
3. Collect only e-mail adresses of the senders
4. Collect only the names of the senders 

In [None]:
import re


file_content = open("./fradulent_emails_utf8.txt", "r", encoding="utf-8").read()
print(file_content[:300])

In [None]:
# Collect lines starting with "From:"
pattern = ??? #Fill the code
result = re.findall(pattern, file_content)
print(result)


The caret operator, per default, only applies to the start of a string. So if you’ve got a multi-line string—for example, when reading a text file—it will still only match once: at the beginning of the string.

However, you may want to match at the beginning of each line. For example, you may want to find all lines that start with ‘From’ in a given string.

You can specify that the caret operator matches the beginning of each line via the re.MULTILINE flag. Here’s an example showing both usages—without and with setting the re.MULTILINE flag:

In [None]:
# Collect lines starting with "From:"
pattern = ??? #Fill the code
result = re.findall(pattern, file_content, re.MULTILINE)
print(result[:10])


In [None]:
#Let's find the e-mail addresses
pattern_email = ??? #Fill the code
#Loop over all lines in result, try to capture an e-mail address and print it

for line in result[:10]:
    email = re.findall(pattern_email, line)
    print(email)


In [None]:
# How can we extract only the names?
# Loop over all lines in result, and use re.findall to find matching patterns and print them
pattern_name = ??? #Fill the code

for line in result[:10]:
    name = re.findall(pattern_name, line)
    print(name)
  


### Greedy vs. non-greedy search

In [None]:
# Let's try the same code on a long string (of all lines), instead of running per line
file_string = ' '.join([x for x in result]) 
# DIY at home:
# Read about list comprehensions: https://towardsdatascience.com/python-basics-list-comprehensions-631278f22c40
# (until "List Comprehensions vs map and filter")
 
#print(file_string[:1000])

In [None]:
#Let's find the e-mail addresses
pattern_email = "<.+@.+>"
email_list = re.findall(pattern_email, file_string)

print(len(email_list))
print(email_list)

#Analyse the output. Is it correct?



##__Basic regex operations/patterns (continued)__



<li>\s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. 
<li>\S (upper case S) matches any non-whitespace character.

<li>\w (lowercase w): matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. 
<li>\W (upper case W) matches any non-word character.
<li>\t, \n, \r -- tab, newline, return


<br>

_5._ In "result" list, collect all the first and the last words in each line.



In [None]:
# The start and the end of a string
pattern_first_word = ??? #Fill the code
pattern_last_word = ??? #Fill the code

for line in result[:10]:
    print(line)
    first_word = re.findall(pattern_first_word, line)
    last_word = re.findall(pattern_last_word, line)
    print(first_word)
    print(last_word) 


##__Basic regex operations/patterns (continued)__
<li>\d -- decimal digit (0-9)
<li>{m} -- specifies that exactly m copies of the previous RE should be matched; fewer matches cause the entire RE not to match. For example, a{6} will match exactly six 'a' characters, but not five.
<li>{m,n} -- causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. 
<li>{m,n}? -- causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. 
    
<br>    
For example, on the 6-character string __'aaaaaa'__, what will the following patterns match?

> a{3,5} 

> a{3,5}? 

> a{3,} 

> a{,5} 

> a{3,}? 

> a{1,5}? 




In [None]:
import re
string = 'aaaaaa'

pattern_list = ["a{3,5}", "a{3,5}?", "a{3,}", "a{,5}", "a{3,}?", "a{1,5}?"] 
for pattern in pattern_list:
    chars = re.findall(pattern, string)
    print(chars)
    

<li>[ ] -- indicates a set of characters. In a set:

> Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'.

> Ranges of characters can be indicated by giving two characters and separating them by a '-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. 

> Special characters lose their special meaning inside sets. For example, \[(+\*)\] will match any of the literal characters '(', '+', '*', or ')'.


> __Negation with '^':__ if the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it’s not the first character in the set.
    
<li> | (or operator) -- indicates alternation. For example a|b will match either "a" or "b" characters. <br>

For more information: https://docs.python.org/3/library/re.html

Now, back to analysing the e-mails!

_6_. collect all (and only) the years appearing in the lines that start with:<br>
> "From r"<br> 
example: From r  Wed Oct 30 21:41:56 2002<br>
> "Date:" <br>
example: Date: Thu, 31 Oct 2002 02:38:20 +0000<br>

Our code should return "2002" in both examples (and nothing else :)

In [None]:
import re
import sys

file_lines = open("./fradulent_emails_utf8.txt", "r", encoding="utf-8").readlines()

pattern_year = "\d{4}" 
# Is this a good pattern? Look at the lines starting with "From r" & "Date:"
# What are other alternatives (with the additional operations we learned)?

# When the regex pattern returns an output, print the line and print the year
for i in range(20):
    line = file_lines[i]
    ??? # Fill the function




### Flags
Flags let you modify some aspects of how regular expressions work. Flags are available in the re module under two 
names, a long name such as IGNORECASE and a short, one-letter form such as I.

_re.search(pattern,string)_ is the same as _re.search(pattern,string,flags=0)_


|syntax|	long syntax|	meaning|
|---|---|---|
|re.I|re.IGNORECASE|ignore case|
|re.S|re.DOTALL|makes dot match newline ("\n")|
|re.M|re.MULTILINE|makes ^ and $ match in every new line|


Example: 

*re.search(pattern,string,flags=re.IGNORECASE|re.S)*

In [None]:
#Let's find all the words "test" with all possible casings
s = 'This is one Test, another TEST, and another test.'

result = re.findall('test', s)
#result = ??? #(ignore casing)
print(result)



## DIY: Basic regex operations

### DIY 1 ##
Write a function that returns "Match" if a given input contains an 'a' followed by at least two 'b's and "No match" otherwise.

In [None]:
import re
def text_match(text):
    pattern = ??? #Add the correct pattern
        if re.findall(pattern,  text):
                return 'Found a match!'
        else:
                return('Not matched!')
    
            
print(text_match("ab")) # No match
print(text_match("aabbbbbc")) # Match
print(text_match("aabcbb")) # No Match

### DIY 2 ##
Write a function that returns "Match" if a given input contains 'z', not at the start or end of the string.

In [None]:
import re
def text_match(text):
    pattern = ??? # Add the correct pattern
    if re.findall(pattern, text):
        print(re.findall(pattern, text))
        return("Match")
    else:
        return("No match")



In [None]:
#Validate your code with the following tests (and the correct answers)
print(text_match("the lazy dog.")) # Match
print(text_match("python exercises.")) # No match
print(text_match("zoos are open again.")) # No match
print(text_match("lazy lazy dog.")) # Match

### DIY 3 ##
Write a function that returns "Match" if a given input contains sequences of lowercase letters joined with a underscore.

In [None]:
import re
def text_match(text):
    pattern = ??? #Add the correct pattern
    if re.findall(pattern, text):
        print(re.findall(pattern, text))
        return("Match")
    else:
        return("No match")
            


In [None]:
#Validate your code with the following tests (and the correct answers)
print(text_match("aab_cbbbc")) # Match
print(text_match(" aab_cbbbc")) # Match
print(text_match("aab_Abbbc")) # No match
print(text_match("Aaab_abbbc")) # No match
print(text_match(" 11ab_abb22")) # No match

### DIY 4 ##
Write a function that returns "Match" if a given input starts with any of the following characters: "a", "b" or "c". It returns "No match" otherwise.


In [None]:
import re
def text_match(text):
    pattern = ??? #Add the correct pattern
    if re.findall(pattern,  text):
        return 'Match'
    else:
        return('No match')
            

In [None]:
print(text_match("aaab")) # Match
print(text_match("ddda")) # No Match 
print(text_match("Aef")) # No match
print(text_match("bddda")) # Match

### DIY 5
Create a function __is_integer__ that accepts a string and returns __True__ if the string is an integer, and __False__ otherwise.

a string is an integer if it:

<li>consists of 1 or more digits
<li>optionally begins with - (minus sign)
<li>does not contain any other non-digit characters.



In [None]:
import re
def is_integer(input):
    # Add your code here

    

In [None]:
#Validate your code
is_integer("") #False
is_integer(" 7") #False
is_integer("3222") #True
is_integer("-875") #True
is_integer("+223") #False
is_integer("00") #True
is_integer("1.0") #False
is_integer("7A") #False

## Groups

So far we have been extracting everything that matches with our patterns. 
"Groups" in regex allow us to pick out parts of the matching patterns. 



Suppose we would like to capture different parts of a given string and assign them to different variables.
We can do this by using "groups", which will be indicated by paranthesis within the regex pattern. In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text.

While using "findall", in the case of multiple groups, the result will be a list of tuples, each tuple containing the different groups captured.

In [None]:
line = "Cats are smarter than dogs"
pattern0 = ".* are .*? .*"
pattern1 = "(.*) are .*? .*"
pattern2 = "(.*) are (.*?) .*"
pattern3 = "((.*) are (.*?) .*)"

#Q1: What should be the result if we use these patterns to match the given line?
#Q2: When multiple groups are captured (as in pattern3), in what order are they stored?

patterns = [pattern0, pattern1, pattern2, pattern3]
for p in patterns:
    result = re.findall(p, line)
    print(line)
    print(result)


### DIY: Groups 

Capture only the username and the host from each e-mail address (at the same time) in the data set of spam e-mails.
For the e-mail address username@gmail.com, your pattern should return ('username', 'gmail').

Example output:<br>
From: "MR. JAMES NGOLA." \<james_ngola2002@maktoob.com\>    (Need to escape arrows to make it appear in markup)

[('james_ngola2002', 'maktoob')]

From: "Mr. Ben Suleman" \<bensul2004nng@spinfinder.com\>

[('bensul2004nng', 'spinfinder')]

In [None]:
import re
import sys

file_lines = open("./fradulent_emails_utf8.txt", "r").readlines()


#Let's find the e-mail addresses
pattern_email = ??? #Add the correct pattern

for line in file_lines[:1000]:
    if line.startswith("From:"):
        email = re.findall(pattern_email, line)
        print(line.strip())
        print(email)
        print()
        

## "Looking around"

Lookahead and lookbehind, collectively called “lookaround”, are zero-length assertions just like the start and end of line, and start and end of word anchors explained earlier. The difference is that lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called “assertions”. **They do not consume characters in the string, but only assert whether a match is possible or not.**

Lookaround allows you to create regular expressions that are impossible to create without them, or that would get very longwinded without them.

|Lookaround	|Name	|What it Does|
|---|---|---|
|(?=foo)|Lookahead|anything that is followed by "foo"|
|(?<=foo)|Lookbehind|anything that is preceded by "foo"|
|(?!foo)|Negative Lookahead|anything that is _not_ followed by "foo"|
|(?<!foo)|Negative Lookbehind|anything that is _not_ preceded by "foo"|



In [None]:
input = "123foo456"
pattern1 = ".(?=foo)"
pattern2 = ".+(?=foo)"
pattern3 = "(?<=foo).+"
pattern4 = ".+(?!f)" 
pattern5 = ".+?(?!f)" 
pattern6 = "(?<!f).+?" 

# What should be the results if we use these patterns to match the given "input"?

# Run each pattern on the given "input" with findall and print the result

patterns = [pattern1, pattern2, pattern3, pattern4, pattern5, pattern6]
for p in patterns:
    result = re.findall(p, input)
    #print(result)



###__What if we need to capture the meta-characters in text?(Revisited)__

Let's try to capture all the years preceded with a plus sign in the following string. Do not capture the plus sign this time! Our code should return ['1789', '2009'].

In [None]:
string = "August 23; +1789; April 14, +2009; September 12 1960; March 3, 1346, October 2, 1982"
pattern = ??? # Add the correct pattern. Tip: use a negative lookbehind.

result = re.findall(pattern, string)
print(result)

### Illustrate lookaround "not capturing" matches

Write a pattern to catch any character followed by "ba". The code should return ['c', 'a'].



In [None]:
text= "cbaba"

pattern = "(.)ba" # Is this a good pattern?
print(re.findall(pattern, text))

## Other Regex Functions
So far we have been using __re.findall__ to retrieve all possible matches.
But this is not always what we want to do.
There are other regex functions we can use:

<br>

#### **re.search(pattern, string, flags=0)**
Scan through string looking for the __first location where the regular expression pattern produces a match__, and return a corresponding match object. Return None if no position in the string matches the pattern.

<br>

#### **re.match(pattern, string, flags=0)** 
If zero or more characters __at the beginning of string__ match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern.




In [None]:
a = "123abc"
print(re.match("[a-z]+",a))


If you want to locate a match anywhere in string, use __search()__ instead.

In [None]:
a = "123abc"
print(re.search("[a-z]+",a))
#print(re.search("[a-z]+",a).group(0))

#### __re.split(pattern, string, maxsplit=0, flags=0)__
Split string by the occurrences of pattern resulting in a list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.


In [None]:
# Examples with "split"
# What will the following split operations return?
string = 'Words,   words, words.'
result1 = re.split('words,', string) 
result2 = re.split('\s', string) 
result3 = re.split('\s+', string)
result4 = re.split(',', string) 
result5 = re.split('(words,)', string)

# What should be the results if we use these patterns to match the given "string"?
### Print the results.


####__re.sub(pattern, replace, string, count=0, flags=0)__
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement "replace". If the pattern isn’t found, string is returned unchanged.



In [None]:
# Example with "sub"
# # What will the following substitution operations return?
string = 'Words, words, words.'
result0 = re.sub(',', '', string)           
result1 = re.sub(',', '*', string)          
result2 = re.sub('\s+', '_', string)           
result3 = re.sub('w(ord)s', r'\1', string)    
result4 = re.sub('w(ord)s', r'\1', string, re.I)    
result5 = re.sub('w(ord)s', r'\1', string, flags=re.IGNORECASE)    

# What should be the results if we use these patterns to match the given "string"?
### Print the results.


#### __re.compile(pattern, flags=0)__
Compile a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods, When should we use this? 

In [None]:
pattern = re.compile('hello')
result = re.findall(pattern, 'hello world')
print(result)

result = re.findall('hello', 'hello world')
print(result)

This course contains material from:
- dataquest.io
- https://www.regular-expressions.info/lookaround.html
- https://docs.python.org/3/library/re.html
- https://www.bogotobogo.com/python/python_regularExpressions.php