Session: Spring 2020

Author      : Mayuresh Anand 

Last updated: April 22 2020 

# Regular Expresstion

* [Wiki Link](https://en.wikipedia.org/wiki/Regular_expression)
* [Google For Education](https://developers.google.com/edu/python/regular-expressions)
* [W3School Link](https://www.w3schools.com/python/python_regex.asp)

[R-Markdown Link](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#tables)

A regular expression (shortened as regex or regexp; also referred to as rational expression)is a sequence of characters that define a search pattern. Usually such patterns are used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

RegEx Functions

The re module offers a set of functions that allows us to search a string for a match:

| Function      | Description                                                            |
| :------------- | :-------------------------------------------------------------------- |
| findall       | Returns a list containing all matches                                  |
| search        | Returns a Match object if there is a match anywhere in the string      |
| split         | Returns a list where the string has been split at each match           |
| sub 	        | Replaces one or many matches with a string                             |
| match	        | Matches first word of the string only                                  |

Oridinary characters match themselves exactly but metacharacters do not match themselves because they have special meaning.

**Metacharacters**

Metacharacters are characters with a special meaning:

|Character | Description                                                               |Example       |
|:-------- |:--------------------------------------------------------------------------|:------------ |
|[]        |A set of characters                                                        |"[a-m]"       |
|\\        |Signals a special sequence (can also be used to escape special characters) |"\d"          |
|.         |Any character (except newline character)                                   |"he..o"       |
|^         |Starts with                                                                |"^hello"      |
|\$         |Ends with                                                                  |"world$"      |
|*         |Zero or more occurrences                                                   |"aix*"        |
|+         |One or more occurrences                                                    |"aix+"        |
|{}        |Exactly the specified number of occurrences                                |"al{2}"       |
|\|        |Either or                                                                  |"falls\|stays" |
|()        |Capture and group                                                          |              |

**Special Sequences**

A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

|Character |	Description                                                                     |	Example 	|
|:---------|:-----------------------------------------------------------------------------------|:--------------|
|\A        | 	Returns a match if the specified characters are at the beginning of the string 	|"\AThe"| 	
|\b        | 	Returns a match where the specified characters are at the beginning or at the end of a word (the "r" in the beginning is making sure that the string is being treated as a "raw string") 	|r"\bain" r"ain\b"| 
|\B        | 	Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word (the "r" in the beginning is making sure that the string is being treated as a "raw string") 	|r"\Bain" r"ain\B" |	
|\d        | 	Returns a match where the string contains digits (numbers from 0-9) 	        s|"\d" 	|
|\D        | 	Returns a match where the string DOES NOT contain digits 	|"\D" 	|
|\s        | 	Returns a match where the string contains a white space character 	|"\s" 	|
|\S        | 	Returns a match where the string DOES NOT contain a white space character 	|"\S" 	|
|\w        | 	Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character) 	|"\w" 	|
|\W        | 	Returns a match where the string DOES NOT contain any word characters 	|"\W" 	|
|\Z        | 	Returns a match if the specified characters are at the end of the string 	|"Spain\Z |

**Sets**

A set is a set of characters inside a pair of square brackets [] with a special meaning:

|Set        | 	Description                                                                     |	
|:----------|:----------------------------------------------------------------------------------|
|[arn] 	    |   Returns a match where one of the specified characters (a, r, or n) are present 	|
|[a-n] 	    |   Returns a match for any lower case character, alphabetically between a and n 	|
|[^arn]     |	Returns a match for any character EXCEPT a, r, and n 	                        |
|[0123]     |	Returns a match where any of the specified digits (0, 1, 2, or 3) are present 	|
|[0-9] 	    |   Returns a match for any digit between 0 and 9 	                                |
|[0-5][0-9] |	Returns a match for any two-digit numbers from 00 and 59 	                    |
|[a-zA-Z]   |	Returns a match for any character alphabetically between a and z, lower case OR upper case 	|
|[+]        |	In sets, +, *, ., \|, (), $,\{\} has no special meaning, so \[+\] means: return a match for any + character in the string|

# Python re module

We are going to see functions
* match
* search
* findall 

## **re.match(x)**

The match function is used for finding matches at **the beginning of a string only!**

Even if you're dealing with a multiline string and include a "^" to try to search at the beginning and use the re.MULTILINE flag, it will still only search the beginning of the string.

A great use case for re.match is testing a single pattern like a phone number or zip code. It's a good way to tell if your test string matches a desired pattern. This is a quick example of testing to make sure a string matches a desired phone number format.

In [3]:
import re

pattern = r'hello'
strng = 'hello world'
result = re.match(pattern, strng)

print(result.group())

hello


In [4]:
result = re.match(r'world', 'hello world')
if result:
    print(result.group()) #Beacause returned value is None we are going to get an error
else:
    print("None object returned")

None object returned


In [6]:
pattern = r'(\d{3})-(\d{3})-(\d{4})'

if re.match(pattern, '925-783-3005'):
    print("phone number is good")

# If the string matches, a match object will be returned; otherwise it will return None.

phone number is good


## **re.search(x)**

This is similar to the match function but it searches the whole string and returns only the first match. *This does not return more than one match*. If the search is successful, search() **returns a match object** or **None otherwise**.

In [14]:
#strng1 = "word:cat!!"

strng2 = 'an example word:cat!! word:Art'

pattern = r'word:\w\w\w'

result = re.search(pattern, strng2)

# If-statement after search() tests if it succeeded
if result:
  print ('found', result.group()) ## 'found word:cat'
else:
  print('did not find')

found word:cat


In [12]:
print(result.group())

word:cat


### Repition

Things get more interesting when you use + and * to specify repetition in the pattern

  *  \+ -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
  *  \* -- 0 or more occurrences of the pattern to its left
  *   ? -- match 0 or 1 occurrences of the pattern to its left 

In [18]:
## i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig') # found, match.group() == "piii"
print(1,match.group())

## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii') # found, match.group() == "ii"
print(2,match.group())

## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx') # found, match.group() == "1 2   3"
print(3, match.group())

match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx') # found, match.group() == "12  3"
print(4, match.group())

match = re.search(r'\d\s*\d\s*\d', 'xx123xx') # found, match.group() == "123"
print(5, match.group())

## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar') # not found, match == None
print(6, "None - so match.group() will give error")

## but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar') # found, match.group() == "bar"
print(7, match.group())


1 piii
2 ii
3 1 2   3
4 12  3
5 123
6 None - so match.group() will give error
7 bar


In [15]:
match = re.search(r'^b\w+', 'foobar') # not found, match == None
print(match)

None


In [17]:
match = re.search(r'^b\w+', 'boobar') # not found, match == None
print(match.group())

boobar


### Email Example

**Finding first letter of the email from string**

In [21]:
strng = 'purple alice-b@google.com monkey dishwasher' #[\w-.]

In [24]:
pattern = r'\w+@\w+' 

# Remember what \w does and what \w+ does

match = re.search(pattern, strng)
print(match)
if match:
    print(match.group())  ## 'b@google'

<re.Match object; span=(13, 21), match='b@google'>
b@google


**Finding full email from string**

In [25]:
match = re.search(r'[\w.-]+@[\w.-]+', strng) # The email contains anyletter, - and . only
if match:
    print(match.group())  ## 'alice-b@google.com

alice-b@google.com


**Finding email from string in the form of username and host by grouping them using ()**

In [27]:
#strng = 'purple alice-b@google.com abc@google.com monkey dishwasher' 
#Above example shows that search will search for only one pattern

match = re.search(r'([\w.-]+)@([\w.-]+)+', strng)

if match:
    print(match.group())   ## 'alice-b@google.com' (the whole match)
    print(match.group(1))  ## 'alice-b' (the username, group 1)
    print(match.group(2))  ## 'google.com' (the host, group 2)


alice-b@google.com
alice-b
google.com


## re.findall() 

finds **all** the matches and returns them as a list of strings, with each string representing one match. 

In [28]:
strng = 'purple alice-b@google.com abc@yahoo.com test@microsoft.com monkey dishwasher'

match = re.findall(r'([\w.-]+)@([\w.-]+)', strng)

if match:
    print(match)   ## 'alice-b@google.com' (the whole match)


[('alice-b', 'google.com'), ('abc', 'yahoo.com'), ('test', 'microsoft.com')]


### Reading from the file 
* f = open(filename,specifier) 
* f.close() 
* f.read()
* f.readline()

In [30]:
f = open("test.txt","w")
f.write("abc-1@a12.com\n")
f.write("abc-1a12.com\n")
f.write("abc-1@a12.com\n")
f.write("abc-1a12.com\n")
f.write("abg@a12.com\n")
f.write("abef@a12.com\n")
f.write("abcd@a12.com\n")
f.close()

In [31]:
# Open file
f = open('test.txt', 'r') #File not in the system
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'([\w.-]+)@([\w.-]+)', f.read())
strings

[('abc-1', 'a12.com'),
 ('abc-1', 'a12.com'),
 ('abg', 'a12.com'),
 ('abef', 'a12.com'),
 ('abcd', 'a12.com')]

# Further reading
* [Cheatsheet](https://www.debuggex.com/cheatsheet/regex/python)
* [Official Documentation](https://docs.python.org/3/library/re.html)