## Regular Expressions


Here the re module is used which lets us use the search function to find regular expressions inside strings. Then, a regular expression is defined r"|[(\d+)\]", which matches a string enclosed in square brackets followed by one or more digits. 

Then, it uses the re.search() function to search the string log for a match to the regular expression. The re.search() function returns a Match object if a match is found, or None if no match is found. 

the re.search() function returns a Match object because the string log contains a match to the regular expression. The Match object has a group() method that returns the captured groups from the match. In this case, the only captured group is the number, which is returned by the result[1] expression.

## 1. Regex vs Index

In [1]:
# Before regex

# we want to extract the process identifier([12345])
log = "July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"
index = log.index("[") # create variable to find an open square bracket character using index
print(log[index+1:index+6]) # print the index starting from index+1(index starts from 0), to index+6(6th index is excluded)

12345


In [4]:
# after regex
import re # import regular expression

log = "July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"
regex = r"\[(\d+)\]"
result = re.search(regex, log) # search a regex from the log 
print(result[1]) 

12345


## 2. Simple matching in Python

## Returns a Match object if there is a match anywhere in the string

In [5]:
import re
# print pattern aza inside a string plaza
result = re.search(r"aza", "plaza") # r in r"aza" means rawstring, it will print a string without interpreting it (raw)
print(result) 

# spam is the position of the match substring inside the string, it will be different according to the string

<re.Match object; span=(2, 5), match='aza'>


In [24]:
import re
result = re.search(r"aza", "mall") # return none if the string doesn't match the expression
print(result) 

None


## Starts with
If you use the circumflex symbol (also known as a caret symbol) ^ as the first character of your regex, it will match only if the pattern occurs at the start of the string. 

In [7]:
print(re.search(r"^x", "xenon")) # cotains a string that starts with x

None
<re.Match object; span=(0, 1), match='x'>


## Ends with
Alternatively, if you use the dollar sign symbol $ at the end of a regex, it will match only if the pattern occurs at the end.

In [23]:
import re
result = re.search(r"aar$", "bazaar") # cotains a string that ends with aar
print(result)

<re.Match object; span=(3, 6), match='aar'>


## Any character (except newline character)

In [29]:
import re
print(re.search(r"p.ng", "clapping")) # cotains any character between p and ng with a certain length
print(re.search(r"p..n", "spoong"))
print(re.search(r"a.e.i", "academia")) # contains the vowels a, e and i, with exactly one occurrence of any other character in between.

<re.Match object; span=(4, 8), match='ping'>
<re.Match object; span=(1, 5), match='poon'>
<re.Match object; span=(2, 7), match='ademi'>


## Case-insensitive matching

In [25]:
import re
# match any string/character between p and ng
print(re.search(r"p.ng", "Pangaea", re.IGNORECASE)) # ignores uppercase or lowercase

<re.Match object; span=(0, 4), match='Pang'>


## 3. Wildcards and character classes

## A set of characters

In [1]:
import re
print(re.search(r"[Pp]ython", "Python")) # cotains python and allow both lowercase and uppercase

<re.Match object; span=(0, 6), match='Python'>


## Character ranges
Character ranges can be used to match a single character against a set of possibilities. 

## Returns a match for any character alphabetically between a and z, 
## lower case OR upper case

In [8]:
import re
print(re.search(r"[a-z]way", "The end of the highway")) # [a-z]way : cotains any lowercase letter before "way" letters
print(re.search(r"[A-Z]way", "Sway")) # [A-Z]way : cotainsany uppercase letter before "way" letters
print(re.search(r"[A-Z]way", "What a way to go")) # returns none because there is no uppercase letter before "way" in this string

print(re.search("cloud[a-zA-Z0-9]", "cloudy")) # cotains any lowercase, uppercase, number after the word "cloud"
print(re.search("cloud[a-zA-Z0-9]", "cloud9"))

<re.Match object; span=(18, 22), match='hway'>
<re.Match object; span=(0, 4), match='Sway'>
None
<re.Match object; span=(0, 6), match='cloudy'>
<re.Match object; span=(0, 6), match='cloud9'>


## 	A set of symbols

In [9]:
# code to check if the text passed contains punctuation symbols 
import re
def check_punctuation (text):
  result = re.search(r"[,.:;?!]", text) # checks commas, periods, colons, semicolons, question marks, and exclamation points.
  return result != None

print(check_punctuation("This is a sentence that ends with a period.")) # True
print(check_punctuation("This is a sentence fragment without a period")) # False
print(check_punctuation("Aren't regular expressions awesome?")) # True
print(check_punctuation("Wow! We're really picking up some steam now!")) # True
print(check_punctuation("End of the line")) # False

True
False
True
True
False


## Returns a match for any character EXCEPT a-z and A-Z

In [8]:
# [^] : match any characters that aren't in a group.

import re
print(re.search(r"[^a-zA-Z]", "This is a sentence with spaces.")) # [^a-zA-Z] : looks for any character thats not a letter
print(re.search(r"[^a-zA-Z ]", "This is a sentence with spaces.")) # [^a-zA-Z ] : looks for any character thats not a letter and not a space



<re.Match object; span=(4, 5), match=' '>
<re.Match object; span=(30, 31), match='.'>


## 4. Repetition Qualifier

## Zero or more occurrences

In [13]:
import re
# the star(*) returns Zero or more occurrences, takes as many characters are possible that matches the expression
print(re.search(r"Py.*n", "Pygmalion")) # returns the whole letter that are in between of "Py" and "n"
print(re.search(r"Py.*n", "Python Programming")) # returns the whole letter that are in between of "Py" and "n"

print(re.search(r"Py[a-z]*n", "Python Programming")) # returns all the letter thats lowercase after "Py"
print(re.search(r"Py[a-z]*n", "Pyn"))

<re.Match object; span=(0, 9), match='Pygmalion'>
<re.Match object; span=(0, 17), match='Python Programmin'>
<re.Match object; span=(0, 6), match='Python'>
<re.Match object; span=(0, 3), match='Pyn'>


## One or more occurrences

In [14]:
import re
print(re.search(r"o+l+", "goldfish")) # the "+" match a string if there is One or more occurrences of an o followed by l 
print(re.search(r"o++l++", "woolly")) # the plus match a string if there is Two or more occurrences of an o followed by l 
print(re.search(r"o+l+", "boil"))

<re.Match object; span=(1, 3), match='ol'>
<re.Match object; span=(1, 5), match='ooll'>
None


## Zero or one occurrences

In [12]:
import re
print(re.search(r"p?each", "To each their own")) # match a string if there is Zero or one occurrences of a character before "each"
print(re.search(r"p?each", "I like peaches"))

<re.Match object; span=(3, 7), match='each'>
<re.Match object; span=(7, 12), match='peach'>


## 5. Escaping characters
Special characters like dot, star, plus, question mark, circumflex, dollar sign, and square brackets can be used in regular expressions to match different kinds of strings.

If we want to match one of these special characters in a string, we need to use an escape character, which is a backslash "\"

For example, if we want to match a dot in a string, we would use \. instead of just a dot. Backslashes can also be used to escape

Using raw strings helps avoid confusion because special characters are not interpreted when generating the string, only when parsing the regular expression.

In [3]:
# match finding character that contains a dot
import re
print(re.search(r".com", "welcome")) # match any character instead of characters that have dot
print(re.search(r"\.com", "welcome"))
print(re.search(r"\.com", "mydomain.com")) # instead we use backslash to find characters with an actual dot

<re.Match object; span=(2, 6), match='lcom'>
None
<re.Match object; span=(8, 12), match='.com'>


In [2]:
# match aplhanumeric(alphabet, letter, symbols, etc.) charater
import re
print(re.search(r"\w*", "This is an example")) # print all the alphanum, it stops after there is a space because space is not alphanum
print(re.search(r"\w*", "And_this_is_another"))

<re.Match object; span=(0, 4), match='This'>
<re.Match object; span=(0, 19), match='And_this_is_another'>


In [None]:
# check if the text passed has at least 2 groups of alphanumeric characters (including letters, numbers, and underscores) 
# separated by one or more whitespace characters.

import re
def check_character_groups(text):
  result = re.search(r"\w*\d+", text) # \d is to check if there is one or more whitespace that seperates the alphanumeric char
  return result != None

print(check_character_groups("One")) # False
print(check_character_groups("123  Ready Set GO")) # True
print(check_character_groups("username user_01")) # True
print(check_character_groups("shopping_list: milk, bread, eggs.")) # False

## 6. Regex in Action 

In [4]:
# check if its start and end with an a 
import re
# A.*a, the dot is to look for letter in between A and a, while the star returns the whole letter in between A and a
print(re.search(r"A.*a", "Argentina")) 
print(re.search(r"A.*a", "Azerbaijan")) # the n letter at the end is not printed 
print(re.search(r"^A.*a$", "Australia")) # the $ sign means that we only want to match lines that begin and end with the letter a   

<re.Match object; span=(0, 9), match='Argentina'>
<re.Match object; span=(0, 9), match='Azerbaija'>
<re.Match object; span=(0, 9), match='Australia'>


construct a pattern that would validate if the string is a valid variable name in Python.

It can contain any number of letters, numbers or underscores, but it can't start with a number.

In [7]:
import re
pattern = r"^[a-zA-Z_][a-zA-Z0-9_]*$" # start with [a-zA-Z_] and [a-zA-Z0-9_] at the end
print(re.search(pattern, "_this_is_a_valid_variable_name")) # can starts with an underscore
print(re.search(pattern, "this isn't a valid variable")) # none because it contains spaces
print(re.search(pattern, "my_variable1")) # can ends with a character
print(re.search(pattern, "2my_variable1")) # none because can't starts with numbers

<re.Match object; span=(0, 30), match='_this_is_a_valid_variable_name'>
None
<re.Match object; span=(0, 12), match='my_variable1'>
None


check if the text passed looks like a standard sentence, meaning that it starts with an uppercase letter, followed by at least some lowercase letters or a space, and ends with a period, question mark, or exclamation point. 

In [None]:
import re
def check_sentence(text):
  result = re.search(r"^[A-Z][a-z ]*[.?!]$", text)
  return result != None

print(check_sentence("Is this is a sentence?")) # True
print(check_sentence("is this is a sentence?")) # False
print(check_sentence("Hello")) # False
print(check_sentence("1-2-3-GO!")) # False
print(check_sentence("A star is born.")) # True