# Regular Expressions

### What are regular expressions?
A Regular Expression (RegEX), is a sequence of characters that forms a search pattern. RegEx are used in many programming languages, text editors, and other tools as a means of determining whether a string matches a specified pattern.        
Common cases where regular expressions are used include:
- Search in search engines.
- Validation of the format of email addresses, phone numbers or passwords during registration in a web portal.
- Manipulation of textual data in data science projects.

### Regular Expresssions in Python.
In Python, regular expressions are supported by the **re module**.Therefore, to  use regex in python scripts we have to import the module.  as shown below:

In [None]:
import re

This library provides several functions that make it posssible to search a string for a match. Some of the most common ones include:
- **re.compile() -**
Compiles a regular expression pattern into a regular expression object, which can be used for matching. It is more efficient when the expression will be used several times in a single program.
- **re.search() -**
Scans through a string looking for the first location where the regular expression pattern produces a match, and returns a corresponding match object. Otherwise, it returns None.
- **re.findall() -**
Returns all matches of a pattern in a string as a list of strings.
- **re.finditer() -**
Returns an iterator containing match objects of all matches of a pattern in a string. 
- **re.sub() -**
Returns the string obtained by replacing the occurrences of a pattern in the string by the replacement repl. If the pattern isn’t found, string is returned unchanged.  
- **re.split() -**
Split string by the occurrences of a specified pattern.      

More functions can be found in the [Python Documenation](https://docs.python.org/3/library/re.html)


### Below is a simple docstring which we shall use regular expressions on.

In [None]:
text_to_search = '''This is just a simple line of text
abcdefghijklmnopqurtuvwxyz
abc
abcabcabc
abcabcabcabcabc
ABCDEFGHIJKLMNOPQRSTUVWXYZ
I love cats.
.com
1234567890
La LaLa
. ^ $ * + ? { } [ ] \ | ( )
This marks the end'''

#### re.search()

We can perform simple regex using literals as illustrated below:

In [None]:
match_object = re.search('ABC', text_to_search) #case sensitive
print (match_object)


This returns an object containing the first occurence of "abc".

In [None]:
print (text_to_search[92:95])

#### Working with Match Objects

The Match object has properties and methods:

- **span() -** returns a tuple containing the start and end positions of the match.
- **string -** returns the string passed into the function.
- **group() -** returns the part of the string where there was a match

In [None]:
print (f"{match_object}\n")
print (f"The start and end indices: {match_object.span()}\n")
print ("The pattern matched: " + match_object.group() + "\n")
print ("\tThe entire string: \n ___________________________________\n" + match_object.string)


#### re.compile()

In [None]:
pattern = re.compile(r'abc')

#### re.findall()

In [None]:
letters = pattern.findall(text_to_search)#returns all occurences.
for match in letters:
    print (match)

#### re.finditer()

In [None]:
letters_object = pattern.finditer(text_to_search)#returns match objects of all occurences.
for object in letters_object:
    print (object)

##### re.sub()

In [None]:
substitute = re.sub("cats","dogs",text_to_search) #replace the word cats with dogs.
print(substitute)

##### re.split()

In [None]:
split_string = re.split("\s", text_to_search) #split the string at the whitespaces.
print(split_string)

### Metacharacters
These are characters that are interpreted in a special way by a RegEx engine.

##### \\ - Escapes special characters or denotes character classes.

In [None]:
periods = re.finditer(r"\.",text_to_search)
for period in periods:
    print(period)


##### . - Matches any character except line terminators like \n (new line).


In [None]:
period_char = re.findall(r".",text_to_search)
print(period_char)


##### ^ - Matches the literal at the start of a string.  Checks if a string starts with a certain character.

In [None]:
start_string = re.search(r"^This", text_to_search)
print (start_string)

##### \$ - Matches the literal at the end of a string. Checks if a string ends with a certain character.

In [None]:
end_string = re.search("end$", text_to_search)
print (end_string)

In [None]:
print(text_to_search)

##### ? - Matches zero or one occurrence of the pattern left to it.

In [None]:
phrase = "pan paan in Japan and paaan or pn is not a pun "

qmark_re = re.finditer(r"pa?n", phrase)
for qmark in qmark_re:
    print (qmark)

##### * - Matches zero or more occurrences of the pattern left to it.

In [None]:
phrase = "pan paan in Japan and paaan or pn is not a pun "

star_re = re.finditer(r"pa*n", phrase)
for star in star_re:
    print (star)

##### + - Matches one or more occurrences of the pattern left to it.

In [None]:
phrase = "pan paan in Japan and paaan or pn is not a pun "

plus_re = re.finditer(r"pa+n", phrase)
for p in plus_re:
    print (p)

##### {} - Matches exactly the specified number of occurrences

In [None]:
alphabet = re.finditer(r"(abc){5}", text_to_search)
for a in alphabet:
    print (a)


##### {n,} -Matches n or more occurrences of preceding expression

In [None]:
alphabet = re.finditer(r"(abc){1,}", text_to_search)
for a in alphabet:
    print (a)

##### {n, m} -Matches at least n and at most m occurrences of preceding expression

In [None]:
alphabet = re.finditer(r"(abc){2,4}", text_to_search)
for a in alphabet:
    print (a)

##### | - (or operator). Matches either or

In [None]:
match_object = re.finditer(("ABCD|abcd"), text_to_search) #case sensitive
for obj in match_object:
    print (obj)

##### () - Used to capture and group sub-patterns

###### 	[...] - Matches any single character in brackets.

##### [^...] -Matches any single character not in brackets

### Special Sequences

Special sequences are denoted by a \ followed by a specified character. They make commonly used patterns easier to write.        

***It is advisable to use raw strings with special sequences.***

In [None]:
print ("\n \t Newline")
print (r"\n \t Newline") #rawstring

##### \\w - Matches alphanumeric characters. *includes _*

In [None]:
alphanumeric = re.findall("\w", text_to_search)
print(alphanumeric)

##### \\d  - Matches digits, which means 0-9.

In [None]:
digits = re.findall("\d", text_to_search)
print(digits)

### \\s - Matches whitespace characters.

In [None]:
whitespaces = re.finditer("\s", text_to_search)
for whitespace in whitespaces:
    print(whitespace)

### \\b | Matches if there is a boundary (or empty string) at the start and end  therefore, mtches if the specified characters are at the beginning or end of a word

In [None]:
words = re.finditer(r"\bLa",text_to_search)
for word in words:
    print(word)

***The characters in uppercase do the opposite.***

### Practical Examples
In this examples we shall be retrieving information from a text file *contacts.txt*

#####  	Getting the phone numbers.

In [None]:
with open ("contacts.txt") as f:
    #print phonenumbers from Kenya
    phonenumbers = re.findall(r"[+]?254\d{9}|07\d{8}", f.read())
    print (f"Kenyan phone numbers:{phonenumbers}")
     


In [None]:
 with open ("contacts.txt") as f:
    #american numbers
    numbers = re.findall(r"[+]?1?-?[(]?\d{3}[).-]\d{3}[.-]\d{4}", f.read())
    print ("American phone numbers:")
    for n in numbers:
        print (n)

#### Printing names of proffessors.

In [None]:
with open ("contacts.txt") as f:
    names = re.findall(r"[Pp]rof.?\s?[a-zA-Z]+", f.read())
    print(names)

##### Getting the email addresses.

In [None]:
 with open ("contacts.txt") as f:
    emails = re.findall(r"[a-zA-Z0-9_-]+\@[a-zA-Z-]+\.[A-Za-z]{2,6}", f.read())
    for e in emails:
     print (e)