<a href="https://colab.research.google.com/github/brunofbpaula/DataScience-UM-Coursera/blob/main/Regex/RegexTraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regex

Regular expressions are used in strings to match patterns. They are written in a condensed formatting language.

Basically, a regex is a pattern which you give to a regex processor with some source data. The processor then parses that source data using that pattern, and return chunks of text back to the data scientist or programmer for futher manipulation.

In [None]:
import re

text = 'It\'s my job'

# Search() checks for a match anywhere in the string, and returns a boolean
if re.search("job", text):
  print('IT\'S OUR JOB')
else:
  print('It\'s time to get a job')

IT'S OUR JOB


In [None]:
text = "Maybe someones don't like me but because I'm maybe too good I don't know why"

# The findall() and split() functions will parse the string for us and return chunks.
text = "Cristiano is the greatest of all time. Cristiano is a generational talent. Cristiano's influence is unreal."
print((re.findall("Cristiano", text)))
print()
re.split("Cristiano", text)

['Cristiano', 'Cristiano', 'Cristiano']



['',
 ' is the greatest of all time. ',
 ' is a generational talent. ',
 "'s influence is unreal."]

# Complex Patterns
The regex specification standard defines a markup language to describe patterns in text.

## Anchors
They specify the start and/or the end of the string that you are trying to match. The caret character ^ means start and the dollar sign $ means end.

In [None]:
text = "Messi is finished. The GOAT is Ronaldo"

# Let's see if the statement above starts with 'Messi'
# It has a boolean value (true if it was found), even though it's an object named re.Match, containing the pattern matched and the location it was in.
if re.search("^Messi", text):
  print("Pessi plays soccer")
else:
  print("Ronaldo plays in the camel league")

# Checking if it ends with 'Ronaldo'
re.search("Ronaldo$", text)

Pessi plays soccer


<re.Match object; span=(31, 38), match='Ronaldo'>

## Patterns and Character Classes

In [None]:
grades = "AABCAAAACBBDFF"

# If we want to find how many As and Bs there are in the string above, we put the characters A and B inside square brackets
re.findall("[AB]", grades)


['A', 'A', 'B', 'A', 'A', 'A', 'A', 'B', 'B']

In [None]:
# This is called the set operator. It's also possible to include a range of characters,
# which are ordered alphanumerically. In instance, if we want to refer to all lower case letters
# we could use [a-z].

# This simple regex parses out all instances where a student receives an A followed by a B or C.
re.findall("[A][B-C]", grades)

['AB', 'AC']

In [None]:
# In addition, the pipe operator can be used to do the same thing.
re.findall("AB|AC", grades)

['AB', 'AC']

In [None]:
# The caret ^ character inside square brackets can also be used to negate results. So, if we decide to parse out only the grades that are not A's, we need to:
re.findall("[^A]", grades)

['B', 'C', 'C', 'B', 'B', 'D', 'F', 'F']

In [None]:
# This returns am empty list because it's trying to match any value at the beginning of the string (the first one) that is not A
# But, as we can see, the string grades starts with an A, so no match is found.
print(grades)
re.findall("^[^A]", grades)

AABCAAAACBBDFF


[]

# Quantifiers
These are the number of times that you want a pattern to be matched in order to match.

The very basic qualifier is expressed as e {m,n}, where e is the expression or character to be matched, m is the minimum number of times it is required to be matched, and n is the maximum number of times the item could be matched.

In [None]:
# Example: how many back-to-back A's streak are there in the grades string?
re.findall("A{2,10}", grades) # 2 to minimum occurances and 10 to max

['AA', 'AAAA']

In [None]:
# Now let's see a different approach. In the previous output, we can see the streak of A's in a combination of two up to ten in a row
# So, it sees four A's as a single pattern. Below, we are only searching for streak of two A's in a row to see how many times it occurs
re.findall("A{1,1}A{1,1}", grades)

['AA', 'AA', 'AA']

In [None]:
# And if you only pass a number inside the curly brackets, it is considered both m and n
re.findall("B{2}", grades)

['BB']

In [None]:
# Another interesting pattern is the decreasing trend in the grades
re.findall("A{1,1}B{1,1}C{1,1}", grades)

['ABC']

There are three other qualifiers that are used as short hand, and asterix * to match zero or more times.

A question mark ? to match one or more times.

And a plus + sign to one or more times.

# Wikipedia
Let's look at a more complex example.

In [None]:
with open("dataset/ferpa.txt", "r") as file:
  wiki = file.read()


In [None]:
import re

text = "https://google.com is better than http://www.baidu.com, ok?"
re.findall("(?<=[https]:\/\/)([A-Za-z0-9.]*)", text)

['google.com', 'www.baidu.com']

In [None]:
text =r'''This text has:
(a) One line
(b) Two lines
(c) Three lines'''

len(re.findall("\(.\)", text))

3