# Regular Expressions (RegEx)

Regular Expressions, or RegEx for short, is a way of achieving complex string matching based on simple or complex patterns.


### Strengths of RegEx
1. Due to its complex syntax, it can allow for programmers to write robust rules in short spaces.

2. It can allow the researcher to find all types of variance in strings

3. It can perform remarkably quickly when compared to other methods.

4. It is universally supported

### Weaknesses of RegEx
1. Its syntax is quite difficult for beginners.

2. It order to work well, it requires a domain-expert to work alongside the programmer to think of all ways a pattern may vary in texts.

# How do we use RegEx in Python?

In [1]:
import re

In [2]:
pattern = r"((\d){1,2} (January|February|March|April|May|June|July|August|September|October|November|December))"

text = "This is a date 2 February. Another date would be 14 August."
matches = re.findall(pattern, text)
print (matches)

[('2 February', '2', 'February'), ('14 August', '4', 'August')]


### Code Breakdown

`(\d){1,2}` says that we are looking for a digit (0-9) that occurs either once or twice {1,2}.

\
Next we have `(January|February|March|April|May|June|July|August|September|October|November|December)` using a whole bunch of `OR`s saying that we will have at least one of these.

### We can also have bigger regular expressions

The following accounts for dates of both types:

`February 2` or `2 February`

In [13]:
pattern = r"(((\d){1,2}( (January|February|March|April|May|June|July|August|September|October|November|December)))|(((January|February|March|April|May|June|July|August|September|October|November|December) )(\d){1,2}))"

text = "This is a date February 2. Another date would be 14 August."
matches = re.findall(pattern, text)
print (matches)

[('February 2', '', '', '', '', 'February 2', 'February ', 'February', '2'), ('14 August', '14 August', '4', ' August', 'August', '', '', '', '')]


### Using an iterator - `re.finditer`

In [8]:
text = "This is a date February 2. Another date would be 14 August."
iter_matches = re.finditer(pattern, text)
print (iter_matches)
for hit in iter_matches:
    print (hit)

<callable_iterator object at 0x107b1b9d0>
<re.Match object; span=(15, 25), match='February 2'>
<re.Match object; span=(49, 58), match='14 August'>


We have start and end indices in the span. We can use this info to grab the dates alone.

In [9]:
text = "This is a date February 2. Another date would be 14 August."
iter_matches = re.finditer(pattern, text)
for hit in iter_matches:
    start = hit.start()
    end = hit.end()
    print (text[start:end])

February 2
14 August


# Using RegEx with the EntityRuler

### How we used to do it

In [14]:
#Import the requisite library
import spacy

#- Sample text
text = "This is a sample number 555-5555."

#- Build upon the spaCy Small Model
nlp = spacy.blank("en")

#- Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

#- List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {"label": "PHONE_NUMBER", "pattern": [{"SHAPE": "ddd"},
                {"ORTH": "-", "OP": "?"}, {"SHAPE": "dddd"}]}
            ]
#- Add patterns to ruler
ruler.add_patterns(patterns)

#- Create the doc
doc = nlp(text)

#- Extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

555-5555 PHONE_NUMBER


### Now we use RegEx

In [22]:

#- Sample text
text = "This is a sample number (555) 555-5555."

#- Build upon the spaCy Small Model
nlp = spacy.blank("en")

#- Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

#- List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {
                    "label": "PHONE_NUMBER", 
                    "pattern": [
                                {"TEXT": {"REGEX": "((\d){3}-(\d){4})"}}
                               ]
                }
            ]

#- Add patterns to ruler
ruler.add_patterns(patterns)


#- Create the doc
doc = nlp(text)

#-s Extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

### EntityRuler cannot use RegEx to pattern match across tokens!

The dash in the phone number throws off the EntityRuler.