# Chapter 1: Regular Expressions

In [None]:
#We will use the "re" package for regular expressions
import re


## **DIY: Find years**
Write a function that returns a list of years between a given input string.
Your function should return ['1789', '2009', '1960', '1982'] in the following example:

In [2]:
text = "August 23; 1789; April 14, 2009; September 12 1960; March 3, 1346, October 2, 1982"

def find_years(text):
    years_list = []
    number = ""
    for s in text:
        if s in ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]:
            number+=s
        else:
            if len(number) == 4:
                years_list.append(number)
            else:
                pass
            number = ""
    #Without this it will not capture 1982
    if len(number) == 4:
        years_list.append(number)
    return years_list

years_list = find_years(text)
print(years_list)

['1789', '2009', '1960', '1346', '1982']


## **Regular expressions**

A regular expression (regex) defines a search pattern for strings. The search pattern can be anything from a simple character, a fixed string or a complex expression containing special characters describing the pattern. The pattern defined by the regex may match one or several times or not at all for a given string.

Regular expressions can be used to search, edit and manipulate text.

The power of regex comes from being able to match complex strings with simple patterns.

## re.findall()
Findall returns all the non-overlapping matches of patterns in a string.

Let's try to capture all (and only) the years in the following date:

In [None]:
string = "August 23; 1789; April 14, 2009; September 12 1960; March 3, 1346, October 2, 1982"
pattern = "\d{4}"

result = re.findall(pattern, string)
print(result)



['1789', '2009', '1960', '1346', '1982']


## __Basic regex operations/patterns__

(Most) Characters match themselves. However, to be able to generate patterns, we need to reserve some symbols (i.e. meta-characters). These **meta-characters** do not match themselves because they have special meanings are: . * + ? ^ \$ \{ \} [ ] | ( ) 


<br>
<li>. (dot) -- matches any single character except newline '\n'
<li>* (star) -- causes the given (preceeding) regex pattern to match 0 or more repetitions. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.
<li>+ (plus) -- causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.
<li>? (q.mark) -- by default, regext patterns are greedy (they match the longest possible string). "?" lets us do non-greedy searches (to match the shortest possible string)
<li>^ (caret) -- matches the start of the string (the string itself is not included)
<li>\$ (dollar) -- matches the end of the string (the string itself is not included)
<br>

Python regex cheatsheet: https://www.dataquest.io/blog/regex-cheatsheet/
Test regular expressions: https://regex101.com/

### **DIY1: A's and B's** 
Write a Python program that matches a string that starts with an **a** followed by **(only) zero or more** **b**'s .



In [16]:
import re
def text_match(text):
        pattern = '^ab*$'
        if re.findall(pattern,  text):
                return 'Found a match!'
        else:
                return('Not matched!')
print(text_match("ac")) #Not matched!
print(text_match("abc")) #Not matched!
print(text_match("a")) # Found a match!
print(text_match("ab")) # Found a match!
print(text_match("abb")) # Found a match!
print(text_match("cabb")) #Not matched!

Not matched!
Not matched!
Found a match!
Found a match!
Found a match!
Not matched!


### __What if we need to capture the meta-characters in text?__

Let's try to capture all the years (with the plus sign if they are preceded with) in the following string:

In [None]:
string = "August 23; +1789; April 14, +2009; September 12 1960; March 3, 1346, October 2, 1982"
pattern = "\+?\d{4}"

result = re.findall(pattern, string)
print(result)

['+1789', '+2009', '1960', '1346', '1982']


###__The Task:__ Analyze Spam Emails

In this tutorial, we’ll use the Fraudulent Email Corpus from Kaggle. It contains thousands of phishing emails sent between 1998 and 2007. 

1. Open the file "fradulant_emails_utf8.txt"
2. Collect the list of all senders
> Write a regex pattern to find all lines starting with "From:" 
3. Collect only e-mail adresses of the senders
4. Collect only the names of the senders 

In [4]:
import re
import codecs
import sys

file_content = codecs.open("./fradulent_emails_utf8.txt", "r", "utf-8").read()
print(file_content[:300])

From r  Wed Oct 30 21:41:56 2002
Return-Path: <james_ngola2002@maktoob.com>
X-Sieve: cmu-sieve 2.0
Return-Path: <james_ngola2002@maktoob.com>
Message-Id: <200210310241.g9V2fNm6028281@cs.CU>
From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
Reply-To: james_ngola2002@maktoob.com
To: webmaster@acl


In [5]:
# Collect lines starting with "From:"
pattern = "^From:.*$"
result = re.findall(pattern, file_content)
print(result)


[]


The caret operator, per default, only applies to the start of a string. So if you’ve got a multi-line string—for example, when reading a text file—it will still only match once: at the beginning of the string.

However, you may want to match at the beginning of each line. For example, you may want to find all lines that start with ‘From’ in a given string.

You can specify that the caret operator matches the beginning of each line via the re.MULTILINE flag. Here’s an example showing both usages—without and with setting the re.MULTILINE flag:

In [8]:
# Collect lines starting with "From:"
pattern = "^From:.*$"
result = re.findall(pattern, file_content, re.MULTILINE)
print(result)

['From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>', 'From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>', 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>', 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>', 'From: "Maryam Abacha" <m_abacha03@www.com>', 'From: Kuta David <davidkuta@postmark.net>', 'From: "Barrister tunde dosumu" <tunde_dosumu@lycos.com>', 'From: "William Drallo" <william2244drallo@maktoob.com>', 'From: "MR USMAN ABDUL" <abdul_817@rediffmail.com>', 'From: "Tunde  Dosumu" <barrister_td@lycos.com>', 'From: MR TEMI JOHNSON <temijohnson2@rediffmail.com>', 'From: "Dr.Sam jordan" <sjordan@diplomats.com>', 'From: p_brown2@lawyer.com', 'From: mic_k1@post.com', 'From: "COL. MICHAEL BUNDU" <mikebunduu1@rediffmail.com>', 'From: "MRS MARIAM ABACHA" <elixwilliam@usa.com>', 'From: " DR. ANAYO AWKA " <anayoawka@hotmail.com>', 'From: " DR. ANAYO AWKA " <anayoawka@hotmail.com>', 'From: "Victor Aloma" <victorloma@netscape.net>', 'From: "Victor Aloma" <victorloma@netscape.n

In [9]:
#Let's find the e-mail addresses
pattern_email = "<.+@.+\..{3}>"
#Simpler option: pattern_email = "<.+@.+>"
#Loop over all lines  in result, try to capture an e-mail address and print it

for line in result[:10]:
    email = re.findall(pattern_email, line)
    print(email)

['<james_ngola2002@maktoob.com>', '<bensul2004nng@spinfinder.com>', '<obong_715@epatra.com>', '<obong_715@epatra.com>', '<m_abacha03@www.com>', '<davidkuta@postmark.net>', '<tunde_dosumu@lycos.com>', '<william2244drallo@maktoob.com>', '<abdul_817@rediffmail.com>', '<barrister_td@lycos.com>']


In [11]:
# How can we extract only the names?
### Write a regex pattern
### Loop over all lines in result, and use re.findall to find matching patterns and print them
pattern_name = '".+"'

for line in result[:10]:
    name = re.findall(pattern_name, line)
    print(name)




['"MR. JAMES NGOLA."', '"Mr. Ben Suleman"', '"PRINCE OBONG ELEME"', '"PRINCE OBONG ELEME"', '"Maryam Abacha"', '"Barrister tunde dosumu"', '"William Drallo"', '"MR USMAN ABDUL"', '"Tunde  Dosumu"', '"Dr.Sam jordan"']


### Greedy vs. non-greedy search

In [17]:
#Let's try the same code on a long string (of all lines), instead of running per line
file_string = ' '.join([x for x in result])
print(file_string[:300])


From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com> From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com> From: "PRINCE OBONG ELEME" <obong_715@epatra.com> From: "PRINCE OBONG ELEME" <obong_715@epatra.com> From: "Maryam Abacha" <m_abacha03@www.com> From: Kuta David <davidkuta@postmark.net> From: "Barrister tunde dosumu" <tunde_dosumu@lycos.com> From: "William Drallo" <william2244drallo@maktoob.com> From: "MR USMAN ABDUL" <abdul_817@rediffmail.com> From: "Tunde  Dosumu" <barrister_td@lycos.com> From: MR TEMI JOHNSON <temijohnson2@rediffmail.com> From: "Dr.Sam jordan" <sjordan@diplomats.com> From: p_brown2@lawyer.com From: mic_k1@post.com From: "COL. MICHAEL BUNDU" <mikebunduu1@rediffmail.com> From: "MRS MARIAM ABACHA" <elixwilliam@usa.com> From: " DR. ANAYO AWKA " <anayoawka@hotmail.com> From: " DR. ANAYO AWKA " <anayoawka@hotmail.com> From: "Victor Aloma" <victorloma@netscape.net> From: "Victor Aloma" <victorloma@netscape.net> From: "JAMES NGOLA" <james_ngola2002@maktoob.com> From:

In [18]:
#If we don't use "?", the pattern will match the longest possible string (i.e. greedy matching)
#? mark makes the search non-greedy, in other words, the pattern will match the shortest possible string

#Let's find the e-mail addresses
pattern_email = "<.+>" 
email_list = re.findall(pattern_email, file_string)

print(len(email_list)) # The length of the e-mail list is 1: because it captures the whole string (until the last >)
print(email_list)


1
['<james_ngola2002@maktoob.com> From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com> From: "PRINCE OBONG ELEME" <obong_715@epatra.com> From: "PRINCE OBONG ELEME" <obong_715@epatra.com> From: "Maryam Abacha" <m_abacha03@www.com> From: Kuta David <davidkuta@postmark.net> From: "Barrister tunde dosumu" <tunde_dosumu@lycos.com> From: "William Drallo" <william2244drallo@maktoob.com> From: "MR USMAN ABDUL" <abdul_817@rediffmail.com> From: "Tunde  Dosumu" <barrister_td@lycos.com> From: MR TEMI JOHNSON <temijohnson2@rediffmail.com> From: "Dr.Sam jordan" <sjordan@diplomats.com> From: p_brown2@lawyer.com From: mic_k1@post.com From: "COL. MICHAEL BUNDU" <mikebunduu1@rediffmail.com> From: "MRS MARIAM ABACHA" <elixwilliam@usa.com> From: " DR. ANAYO AWKA " <anayoawka@hotmail.com> From: " DR. ANAYO AWKA " <anayoawka@hotmail.com> From: "Victor Aloma" <victorloma@netscape.net> From: "Victor Aloma" <victorloma@netscape.net> From: "JAMES NGOLA" <james_ngola2002@maktoob.com> From: "MARTIN CHIME" <mart

In [19]:
# Correct answer:
pattern_email = "<.+?>" 
email_list = re.findall(pattern_email, file_string)
print(email_list[:5])

['<james_ngola2002@maktoob.com>', '<bensul2004nng@spinfinder.com>', '<obong_715@epatra.com>', '<obong_715@epatra.com>', '<m_abacha03@www.com>']


##__Basic regex operations/patterns (continued)__

<li>\s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. 
<li>\S (upper case S) matches any non-whitespace character.

<li>\w (lowercase w): matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. 
<li>\W (upper case W) matches any non-word character.
<li>\t, \n, \r -- tab, newline, return


<br>

_5._ In "result" list, collect all the first and the last words in each line.



In [20]:
pattern_first_word = "^\w+"
pattern_last_word = "\w+$"# Why doesn't this show any matches?
# Because arrows are not word-characters (what we are looking is not necessarily a "word". see the definition of \w)
#pattern_first_word = "^\S+"
#pattern_last_word = "\S+$"

for line in result[:10]:
    print(line)
    first_word = re.findall(pattern_first_word, line)
    last_word = re.findall(pattern_last_word, line)
    print(first_word)
    print(last_word)


From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
['From']
[]
From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>
['From']
[]
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>
['From']
[]
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>
['From']
[]
From: "Maryam Abacha" <m_abacha03@www.com>
['From']
[]
From: Kuta David <davidkuta@postmark.net>
['From']
[]
From: "Barrister tunde dosumu" <tunde_dosumu@lycos.com>
['From']
[]
From: "William Drallo" <william2244drallo@maktoob.com>
['From']
[]
From: "MR USMAN ABDUL" <abdul_817@rediffmail.com>
['From']
[]
From: "Tunde  Dosumu" <barrister_td@lycos.com>
['From']
[]


##__Basic regex operations/patterns (continued)__
<li>\d -- decimal digit (0-9) 
<li>{m} -- specifies that exactly m copies of the previous RE should be matched; fewer matches cause the entire RE not to match. For example, a{6} will match exactly six 'a' characters, but not five.
<li>{m,n} -- causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. 
<li>{m,n}? -- causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string 'aaaaaa', what will the following patterns match?

> a{3,5} (will match 5 'a' characters)

> a{3,5}? (will only match 3 characters, twice)

> a{3,} (will match as many a characters as possible = minimum of 3 characters)

> a{,5} (will match maximum of 5 characters - and 5 characters if possible) and another one character

> a{3,}? (will match max. 3 characters)

> a{1,5}? (will match 1 character)



In [None]:
import re
string = 'aaaaaa'

pattern_list = ["a{3,5}", "a{3,5}?", "a{3,}", "a{,2}", "a{3,}?", "a{1,5}?"]
for pattern in pattern_list:
    chars = re.findall(pattern, string)
    print(chars)
    

['aaaaa']
['aaa', 'aaa']
['aaaaaa']
['aa', 'aa', 'aa', '']
['aaa', 'aaa']
['a', 'a', 'a', 'a', 'a', 'a']


<li>[ ] -- indicates a set of characters. In a set:

> Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'.

> Ranges of characters can be indicated by giving two characters and separating them by a '-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. 

> Special characters lose their special meaning inside sets. For example, \[\(+*\)\] will match any of the literal characters '(', '+', '*', or ')'.


> __Negation with '^':__ if the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it’s not the first character in the set.

For more information: https://docs.python.org/3/library/re.html

Now, back to analysing the e-mails!

_6_. collect all (and only) the years appearing in the lines that start with:<br>
> "From r"<br> 
example: From r  Wed Oct 30 21:41:56 2002<br>
> "Date:" <br>
example: Date: Thu, 31 Oct 2002 02:38:20 +0000<br>

Our code should return "2002" in both examples (and nothing else :)

In [2]:
import codecs
import re
import sys

file_lines = open("./fradulent_emails_utf8.txt", "r").readlines()
#print(file_lines[:100])

#pattern_year = "\d{4}" #This will fail for the lines with "Date:"
#pattern_year = "[^+]\d{4}" #This will also capture the space before it so not the best option

#Correct pattern:
pattern_year = "[1-2]\d{3}" 


for i in range(20):
    line = file_lines[i]
    if line.startswith("From r") or line.startswith("Date:"):
        year = re.findall(pattern_year, line)
        print(line.strip())
        print(year)
    else:
        pass




From r  Wed Oct 30 21:41:56 2002
['2002']
Date: Thu, 31 Oct 2002 02:38:20 +0000
['2002']


### Flags
Flags let you modify some aspects of how regular expressions work. Flags are available in the re module under two 
names, a long name such as IGNORECASE and a short, one-letter form such as I.

_re.search(pattern,string)_ is the same as _re.search(pattern,string,flags=0)_


|syntax|	long syntax|	meaning|
|---|---|---|
|re.I|re.IGNORECASE|ignore case|
|re.S|re.DOTALL|makes dot match newline ("\n")|
|re.M|re.MULTILINE|makes ^ and $ match in every new line|


Example: 

*re.search(pattern,string,flags=re.IGNORECASE|re.S)*

In [None]:
#Let's find all the words "test" with all possible casings
s = 'This is one Test, another TEST, and another test.'

#result = re.findall('test', s)
result = re.findall('test', s, flags=re.I) #(ignore casing) ### ADD flags=

print(result)

['Test', 'TEST', 'test']


## DIY: Basic regex operations

### DIY1 ##
Write a function that returns "Match" if a given input contains an 'a' followed by at least two 'b's and "No match" otherwise.

In [None]:
import re
def text_match(text):
        pattern = 'ab{2,}'
        if re.findall(pattern,  text):
                return 'Found a match!'
        else:
                return('Not matched!')
            
print(text_match("ab")) # No match
print(text_match("aabbbbbc")) # Match
print(text_match("aabcbb")) # No Match

Not matched!
Found a match!
Not matched!


### DIY2 ##
Write a function that returns "Match" if a given input contains 'z', not at the start or end of the word.


In [None]:
import re
def text_match(text):
        #patterns = '.z.' # Also correct
        #patterns = "^[^z]+z[^z]+$" (this will not be correct if there are multiple z's)
        patterns = "^[^z].+z.+[^z]$"
        print(re.findall(patterns,  text))
        if re.findall(patterns,  text):
                return 'Match'
        else:
                return('No match')

print(text_match("the lazy dog.")) # Match
print(text_match("python exercises.")) # No match
print(text_match("zoos are open again.")) # No match
print(text_match("lazy lazy dog.")) # ADD THIS TO illustrate multiple z's

['the lazy dog.']
Match
[]
No match
[]
No match
['lazy lazy dog.']
Match


### DIY 3 ##
Write a function that returns "Match" if a given input contains sequences of lowercase letters joined with a underscore.

In [None]:
import re
def text_match(text):
        pattern = '^[a-z]+_[a-z]+$' 
        
        if re.findall(pattern,  text):
                return 'Match'
        else:
                return('No match')
            
print(text_match("aab_cbbbc")) # Match
print(text_match("aab_Abbbc")) # No match
print(text_match("Aaab_abbbc")) # No match


Match
No match
No match


### DIY 4 ##
Write a function that returns "Match" if a given input starts with any of the following characters: "a", "b" or "c". It returns "No match" otherwise.

In [None]:
import re
def text_match(text):
    #pattern = '^a|b|c' # this is ia needs to be at the beginning but not b or c
    pattern = "^[abc]"
    print(re.findall(pattern, text))
    if re.findall(pattern,  text):
        return 'Match'
    else:
        return('No match')
            

In [None]:
print(text_match("aaab")) # Match
print(text_match("ddda")) # Match
print(text_match("Aef")) # No match
print(text_match("bddda")) # ADD this to illustrate the first pattern

['a', 'b']
Match
[]
No match
[]
No match
['b']
Match


### DIY5  ## 

Create a function __is_integer__ that accepts a string and returns __True__ if the string is an integer, and __False__ otherwise.

a string is an integer if it:

<li>consists of 1 or more digits
<li>optionally begins with -
<li>does not contain any other non-digit characters.



In [11]:
def is_integer(input):
    pattern = "^-?\d+$"
    result = re.findall(pattern, input)
    if result:
      print("True")
    else:
      print("False")
    

In [12]:
#Alternative with bool function
def is_integer(input):
    pattern = "^-?\d+$"
    result = bool(re.findall(pattern, input))
    print(result)



In [13]:
#Validate your code
is_integer("") #False
is_integer(" 7") #False
is_integer("3222") #True
is_integer("-875") #True
is_integer("+223") #False
is_integer("00") #True
is_integer("1.0") #False
is_integer("7A") #False

False
False
True
True
False
True
False
False


## Groups

So far we have been extracting everything that matches with our patterns. 
"Groups" in regex allow us to pick out parts of the matching patterns. 



Suppose we would like to capture different parts of a given string and assign them to different variables.
We can do this by using "groups", which will be indicated by paranthesis within the regex pattern. In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text.

While using "findall", in the case of multiple groups, the result will be a list of tuples, each tuple containing the different groups captured.

In [None]:
line = "Cats are smarter than dogs"
pattern1 = "(.*) are .*? .*"
pattern2 = "(.*) are (.*?) .*"
pattern3 = "((.*) are (.*?) .*)"

result1 = re.findall(pattern1, line)
result2 = re.findall(pattern2, line)
result3 = re.findall(pattern3, line)


print(result1)
print(result2)
print(result3)





['Cats']
[('Cats', 'smarter')]
[('Cats are smarter than dogs', 'Cats', 'smarter')]
Cats are smarter than dogs


### DIY. Groups

Capture only the username and the host from each e-mail address (at the same time) in the data set of spam e-mails.

For the e-mail address username@gmail.com, your pattern should return ('username', 'gmail').


In [16]:
import codecs
import re
import sys

file_lines = open("./fradulent_emails_utf8.txt", "r").readlines()


#Let's find the e-mail addresses
pattern_email = "<(\S+)@(\S+)\.\S+>" 

for line in file_lines[:1000]:
    if line.startswith("From:"):
        email = re.findall(pattern_email, line)
        print(line)
        print(email)
        print()
        

From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>

[('james_ngola2002', 'maktoob')]

From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>

[('bensul2004nng', 'spinfinder')]

From: "PRINCE OBONG ELEME" <obong_715@epatra.com>

[('obong_715', 'epatra')]

From: "PRINCE OBONG ELEME" <obong_715@epatra.com>

[('obong_715', 'epatra')]

From: "Maryam Abacha" <m_abacha03@www.com>

[('m_abacha03', 'www')]

From: Kuta David <davidkuta@postmark.net>

[('davidkuta', 'postmark')]

From: "Barrister tunde dosumu" <tunde_dosumu@lycos.com>

[('tunde_dosumu', 'lycos')]

From: "William Drallo" <william2244drallo@maktoob.com>

[('william2244drallo', 'maktoob')]

From: "MR USMAN ABDUL" <abdul_817@rediffmail.com>

[('abdul_817', 'rediffmail')]

From: "Tunde  Dosumu" <barrister_td@lycos.com>

[('barrister_td', 'lycos')]

From: MR TEMI JOHNSON <temijohnson2@rediffmail.com>

[('temijohnson2', 'rediffmail')]

From: "Dr.Sam jordan" <sjordan@diplomats.com>

[('sjordan', 'diplomats')]

From: p_brown2@lawyer.c

## "Looking around"

Lookahead and lookbehind, collectively called “lookaround”, are zero-length assertions just like the start and end of line, and start and end of word anchors explained earlier in this tutorial. The difference is that lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called “assertions”. **They do not consume characters in the string, but only assert whether a match is possible or not.** 

Lookaround allows you to create regular expressions that are impossible to create without them, or that would get very longwinded without them.

|Lookaround	|Name	|What it Does|
|---|---|---|
|.(?=foo)|Lookahead|any character followed by "foo"|
|(?<=foo).|Lookbehind|any character preceded by "foo"|
|.(?!foo)|Negative Lookahead|any character _not_ followed by "foo"|
|(?<!foo).|Negative Lookbehind|any character _not_ preceded by "foo"|




In [None]:
string = "123foo456"
pattern1 = ".(?=foo)"
pattern2 = ".+(?=foo)"
pattern3 = "(?<=foo).*"
pattern4 = ".*(?!f)" # will capture the whole string (Greedy)
pattern5 = ".+?(?!f)" # will not capture 3, will capture 3f (non-greedy)
pattern6 = "(?<!f).+?" # will not capture the first "o"

patterns = [pattern1, pattern2, pattern3, pattern4, pattern5, pattern6]
for p in patterns:
    result = re.findall(p, string)
    print(result)



['3']
['123']
['456']
['123foo456', '']
['1', '2', '3f', 'o', 'o', '4', '5', '6']
['1', '2', '3', 'f', 'o', '4', '5', '6']


### __What if we need to capture the meta-characters in text? (Revisited)__

Let's try to capture all the years preceded with a plus sign in the following string. Do not capture the plus sign this time! Our code should return ['1789', '2009'].

In [None]:
string = "August 23; +1789; April 14, +2009; September 12 1960; March 3, 1346, October 2, 1982"
pattern = "(?<=\+)\d{4}"

result = re.findall(pattern, string)
print(result)

### Illustrate lookaround "not capturing" matches

Write a pattern to catch any character followed by "ba". The code should return ['c', 'a'].

In [None]:
text= "cbaba"

pattern = "(.)ba" # This will consume (ba) everytime it matches (even though we are not returning "ba" via groups)
# This will catch both c and a, as it does not consume the "ba"
pattern = ".(?=ba)" 

print(re.findall(pattern, text))

['c', 'a']


In [None]:
import codecs
import re
import sys

#file_lines = codecs.open("./fradulent_emails_utf8.txt", "r").readlines()

lines = ['From: "Dr. Bernard Makelele" <benkelele@phantomemail.com>', 'From: "Dr.Sam jordan" <sjordan@diplomats.com>', 
           'From: "DR MARIAM ABACHA" <elixwilliam@usa.com>', 'From: "bell.idr bell.idr" <bbell.idr@caramail.com>']

# Add your code here


## Other Regex Functions
So far we have been using __re.findall__ to retrieve all possible matches.
But this is not always what we want to do.
There are other regex functions we can use:

<br>

#### **re.search(pattern, string, flags=0)**
Scan through string looking for the __first location where the regular expression pattern produces a match__, and return a corresponding match object. Return None if no position in the string matches the pattern.

<br>

#### **re.match(pattern, string, flags=0)** 
If zero or more characters __at the beginning of string__ match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern.




In [None]:
a = "123abc"
print(re.match("[a-z]+",a))


None


If you want to locate a match anywhere in string, use __search()__ instead.

In [None]:
a = "123abc"
print(re.search("[a-z]+",a))
print(re.search("[a-z]+",a).group())

<re.Match object; span=(3, 6), match='abc'>
abc


#### __re.split(pattern, string, maxsplit=0, flags=0)__
Split string by the occurrences of pattern resulting in a list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.


In [None]:
# Examples with "split"
string = 'Words,   words, words.'
result1 = re.split('words,', string) 
result2 = re.split('\s', string) 
result3 = re.split('\s+', string)
result4 = re.split(',', string) 
result5 = re.split('(words,)', string) # Capturing groups will appear in the output

# What should be the results if we use these patterns to match the given "string"?
### Print the results.

print(result1)
print(result2)
print(result3)
print(result4)
print(result5)

['Words,   ', ' words.']
['Words,', '', '', 'words,', 'words.']
['Words,', 'words,', 'words.']
['Words', '   words', ' words.']
['Words,   ', 'words,', ' words.']


#### __re.sub(pattern, replace, string, count=0, flags=0)__
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement "replace". If the pattern isn’t found, string is returned unchanged.



In [None]:
# Example with "sub"
string = 'Words, words, words.*'
result0 = re.sub(',', '', string)           # Delete all the commas
result1 = re.sub(',', '*', string)           # Replace commas with * (the replacement is a string)
result2 = re.sub('\s+', '_', string)           # Replace a sequence of spaces with underscores
result3 = re.sub('w(ord)s', r'\1', string)    # Replace a string with a part of itself. r indicates the group (regex)
result4 = re.sub('w(ord)s', r'\1', string, re.IGNORECASE)    # This will not work as the att is passed to "count" and not "flag"
result5 = re.sub('w(ord)s', r'\1', string, flags=re.IGNORECASE)    # Replace a string with a part of itself, case insensitive

print(result0)
print(result1)
print(result2)
print(result3)
print(result4)
print(result5)

Words words words.*
Words* words* words.*
Words,_words,_words.*
Words, ord, ord.*
Words, ord, ord.*
ord, ord, ord.*


#### __re.compile(pattern, flags=0)__
Compile a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods, When should we use this? 
- reusability
- increased speed if repetitively used (in a loop for example)

In [15]:
pattern = re.compile('hello')
result = re.findall(pattern, 'hello world')
print(result)

result = re.findall('hello', 'hello world')
print(result)

['hello']
['hello']


This course contains material from:

- dataquest.io
- https://www.regular-expressions.info/lookaround.html
- https://docs.python.org/3/library/re.html
- https://www.bogotobogo.com/python/python_regularExpressions.php