# **Regular Expressions**

A regular expression, or ***`RegEx`***, is a special text string that helps to find patterns in data. A RegEx can be used to check if some patterns exists in a different data type. To use RegEx in python first we should import the RegEx module which is called ***`re`***.

In [2]:
import re

### **Methods in the *`re`* module**

To find a pattern we use different set of re character sets that allows to search for a match in a string.

- re.`match`(): Searches only in the beginning of the first line of the string and returns matched objects if found, else returns `None`.
- re.`search`(): Returns a match object if there is one anywhere in the string, including multiline strings.
- re.`findall`(): Returns a list containing all matches.
- re.`split`(): Takes a string, splits it at the match points, returns a list.
- re.`sub`(): Replaces one or many matches within a string.

#### **.match()**

In [3]:
txt = "I love to learn Python and Data Science"
match = re.match("I love to learn", txt, re.I) # re.I is case ignore.
match

<re.Match object; span=(0, 15), match='I love to learn'>

In [4]:
# We can get the starting and ending position of the match as tuple using .span()
span = match.span()
span

(0, 15)

In [5]:
start, end = span
print(start, end)

0 15


In [6]:
substring = txt[start:end]
substring

'I love to learn'

As you can see from the example above, the pattern we are looking for (or the substring we are looking for) is *`I love to learn`*. The match function returns an object only if the text starts with the pattern.

In [8]:
match_2 = re.match("I like to learn", txt, re.I)
print(match_2)

None


The string does not string with *`I like to learn`*, therefore there was no match and the match method returned None.

#### **.search()**

In [24]:
txt = """
Python is the most beatiful programming language that a human being has ever created.
The language python is recommended as a first one for learning.
"""
match = re.search("first", txt, re.I)
match

<re.Match object; span=(127, 132), match='first'>

In [15]:
span = match.span()
span

(127, 132)

In [16]:
start, end = span
substring = txt[start:end]
print(substring)

first


As observed, *`.search()`* is much better than *`.match()`* because it can look for the pattern throughout the text. Search returns a match object with a first match that was found, otherwise it returns `None`. A much better re function is *`.findall()`*. This function checks for the pattern through the whole string and returns all the matches as a list.

#### **.findall()**

In [17]:
matches = re.findall("language", txt, re.I)
matches

['language', 'language']

In [18]:
matches = re.findall("python", txt, re.I)
matches

['Python', 'python']

Since we are using *`re.I`* both lowercase and uppercase letters are included. If we do not have the *`re.I`* flag, then we will have to write our pattern differently.

In [19]:
matches = re.findall("Python|python", txt)
matches

['Python', 'python']

In [20]:
matches = re.findall("[Pp]ython", txt)
matches

['Python', 'python']

### **Replacing a substring**

In [25]:
match_replaced = re.sub("[Pp]ython", "JavaScript", txt, re.I)
# OR match_replaced = re.sub("Python|python", "JavaScript", txt, re.I)
print(match_replaced)


JavaScript is the most beatiful programming language that a human being has ever created.
The language JavaScript is recommended as a first one for learning.



In [26]:
txt = """%I a%m te%%a%%che%r% a%n%d %% I l%o%ve te%ach%ing. 
T%he%re i%s n%o%th%ing as r%ewarding a%s e%duc%at%i%ng a%n%d e%m%p%ow%er%ing p%e%o%ple.
I fo%und te%a%ching m%ore i%n%t%er%%es%ting t%h%an any other %jobs. 
D%o%es thi%s m%ot%iv%a%te %y%o%u to b%e a t%e%a%cher?"""

matches = re.sub("%", "", txt)
print(matches)

I am teacher and  I love teaching. 
There is nothing as rewarding as educating and empowering people.
I found teaching more interesting than any other jobs. 
Does this motivate you to be a teacher?


### **Splitting text using RegEx split**

In [27]:
txt = """I am a teacher and I love teaching.
There is nothing as rewarding as educating and empowering people.
I found teaching more interesting than any other jobs.
Does this motivate you to be a teacher?"""

print(re.split("\n", txt))

['I am a teacher and I love teaching.', 'There is nothing as rewarding as educating and empowering people.', 'I found teaching more interesting than any other jobs.', 'Does this motivate you to be a teacher?']


### **Writing RegEx patterns**

To declare a string variable we use a single or double quote. To declare a RegEx variable we use *`r''`*. The following pattern only identifies apple with lowercase. To make it insensitive either we should re-write our pattern or add a flag.

In [30]:
regex_pattern = r'apple'
txt = "Apple and banana are fruits. An old cliché that says: 'an apple a day, a doctor away' has been replaced by 'a banana a day keeps the doctor far far away'."

matches = re.findall(regex_pattern, txt)
print(matches)

['apple']


In [31]:
# To make it case insensitive by adding flag
matches = re.findall(regex_pattern, txt, re.I)
print(matches)

['Apple', 'apple']


In [32]:
# or we can use a set of characters method
regex_pattern = r'[Aa]pple'
matches = re.findall(regex_pattern, txt)
print(matches)

['Apple', 'apple']


### **RegEx Nomenclature**

- `[]`: A set of characters.
    - `[a-c]` means either a, b or c.
    - `[a-z]` means any letter from a to z.
    - `[A-Z]` means any letter from A to Z.
    - `[0-3]` means 0 or 1 or 2 or 3.
    - `[0-9]` means any number from 0 to 9.
    - `[A-Za-z0-9]` means any single character that is a to z, A to Z or 0 to 9.
- `\`: Uses to escape special characters.
    - `\d` means match where the string contains digits (numbers from 0 to 9).
    - `\D` means match where the string does not contains digits.
- `.`: Any character except new line character (\n)
- `^`: Starts with
    - `r'^substring'`. Eg: `r'^love'`, a sentence that starts with a word love.
    - `r'[^abc]'` means neither a, b nor c.
- `$`: Ends with
    - `r'substring$'`. Eg: `r'love$'`, a sentence that ends with a word love.
- `*`: Zero or more times
    - `r'[a]*'` means optional or it can occur many times.
- `+`: One or more times
    - `r'[a]+'` means at least once (or more).
- `?`: One or more times
    - `r'[a]?'` means zero times or more.
- `{3}`: Exactly 3 characters.
- `{3,}`: At least 3 characters.
- `{3,8}`: Between 3 and 8 characters.
- `|`: Either or
    - `r'apple|banana'` means either apple or banana.
- `()`: Capture and group.

### **Examples**

In [33]:
txt = "Apple and banana are fruits. An old cliché that says: 'an apple a day, a doctor away' has been replaced by 'a banana a day keeps the doctor far far away'."

regex_pattern = r'[Aa]pple|[Bb]anana' # Either apple or banana
matches = re.findall(regex_pattern, txt)
print(matches)

['Apple', 'banana', 'apple', 'banana']


In [34]:
regex_pattern = r'\d' # match digits
txt = "This regular expression example was made on December 6, 2019 and revised on July 8, 2021"
matches = re.findall(regex_pattern, txt)
print(matches)

['6', '2', '0', '1', '9', '8', '2', '0', '2', '1']


In [35]:
regex_pattern = r'\d+' # match digits, '+' one or more times.
matches = re.findall(regex_pattern, txt)
print(matches)

['6', '2019', '8', '2021']


In [36]:
regex_pattern = r'[a].' # Match a and '.' except in a new line.
txt = "Apple and banana are fruits"
matches = re.findall(regex_pattern, txt)
print(matches)

['an', 'an', 'an', 'a ', 'ar']


In [37]:
regex_pattern = r'[a].+'
txt = "Apple and banana are fruits"
matches = re.findall(regex_pattern, txt)
print(matches)

['and banana are fruits']


In [38]:
regex_pattern = r'[a].*'
txt = "Apple and banana are fruits"
matches = re.findall(regex_pattern, txt)
print(matches)

['and banana are fruits']


In [39]:
txt = """I am not sure if there is a convention how to write the word e-mail.
Some people write it as email others may write it as Email or E-mail."""

regex_pattern = r'[Ee]-?mail' # '-' is optional
matches = re.findall(regex_pattern, txt)
print(matches)

['e-mail', 'email', 'Email', 'E-mail']


In [40]:
txt = "This regular expression example was made on December 6, 2019 and revised on July 8, 2021"
regex_pattern = r'\d{4}' # Exactly 4 times
matches = re.findall(regex_pattern, txt)
print(matches)

['2019', '2021']


In [41]:
regex_pattern = r'^This' # Starts with
matches = re.findall(regex_pattern, txt)
print(matches)

['This']


In [42]:
print(re.findall(r'[^A-Za-z ]+', txt)) # ^ in set character means negation, not A to Z, not a to z, no spaces

['6,', '2019', '8,', '2021']


### **Exercises**

- What is the most frequent word in the following paragraph?

***I love to learn. If you do not love to learn, what else can you love? I love Python. If you do not love something which can give you all the capabilities to create beatiful things, what else can you love?***

In [3]:
from collections import Counter

paragraph = "I love to learn. If you do not love to learn, what else can you love? I love Python. If you do not love something which can give you all the capabilities to create beatiful things, what else can you love?"

In [4]:
# Creating a list with separate words
words_list = re.findall(r'[a-zA-Z]+', paragraph)

# Apply Counter method from collections to calculate the frequency of the words and convert result into a dictionary
word_count = dict(Counter(words_list))

# Sort count in descending order
sorted_word_count = sorted(word_count.items(), key=lambda x: x[1], reverse=True)
sorted_word_count

[('love', 6),
 ('you', 5),
 ('to', 3),
 ('can', 3),
 ('I', 2),
 ('learn', 2),
 ('If', 2),
 ('do', 2),
 ('not', 2),
 ('what', 2),
 ('else', 2),
 ('Python', 1),
 ('something', 1),
 ('which', 1),
 ('give', 1),
 ('all', 1),
 ('the', 1),
 ('capabilities', 1),
 ('create', 1),
 ('beatiful', 1),
 ('things', 1)]

- Extract the numbers from this whole text and find the distance between the two furthest particles.

***The position of some particles on the horizontal x-axis are -12, -4, -3 and -1 in the negative direction, 0 at origin, 4 and 8 in the positive direction.***

In [8]:
txt = "The position of some particles on the horizontal x-axis are -12, -4, -3 and -1 in the negative direction, 0 at origin, 4 and 8 in the positive direction."
matches = re.findall(r'[-]?\d+', txt, re.I)

# Convert matches into integers
positions = [int(i) for i in matches]

# Calculate distance between furthest positions (distance = x2 - x1)
distance = positions[-1] - positions[0]
print(f"The distance between the two furthest particle positions is: {distance} (random units)")

The distance between the two furthest particle positions is: 20 (random units)


- Write a pattern which identifies if a string is a valid python variable.

|                Test              | Result |
| -------------------------------- | ------ |
| is_valid_variable('first_name')  | True   |
| is_valid_variable('first-name')  | False  |
| is_valid_variable('1first_name') | False  |
| is_valid_variable('firstname')   | True   |

In [24]:
def is_valid_var(var):
    match = re.match(r'^[a-z_][a-zA-Z0-9_]*$', var)
    return bool(match)

In [27]:
is_valid_var("first_name")

True

- Clean the following text. After cleaning, count three most frequent words in the string.

***%I $am@% a %tea@cher%, &and& I lo%#ve %tea@ching%;. There $is nothing; &as& mo@re rewarding as educa@ting &and& @emp%o@wering peo@ple. ;I found tea@ching m%o@re interesting tha@n any other %jo@bs. %Do@es thi%s mo@tivate yo@u to be a tea@cher!?***

In [40]:
txt = "%I $am@% a %tea@cher%, &and& I lo%#ve %tea@ching%;. There $is nothing; &as& mo@re rewarding as educa@ting &and& @emp%o@wering peo@ple. ;I found tea@ching m%o@re interesting tha@n any other %jo@bs. %Do@es thi%s mo@tivate yo@u to be a tea@cher!?"

# Find alphanumeric characters
match = re.sub(r'[%@&#;$!]', "", txt)

# Extract words individually
words_list = re.findall(r'[A-Za-z]+', match)

# Count frequency of words
word_count = dict(Counter(words_list))

# Sort in descending order
sorted_word_count = sorted(word_count.items(), key=lambda x: x[1], reverse=True)
sorted_word_count[:3]

[('I', 3), ('a', 2), ('teacher', 2)]