In [None]:
# Import module
import re

## **Introduction**

* Regular expressions are a powerful language for matching text patterns.

* The Python "re" module provides regular expression support.

# **Simple patterns**

## **Metacharacters**

* Most letters and characters will simply match themselves. *Metacharacters* do NOT match themselves.

* Examples:

  1. "**.**" = Matches any character except a newline.

  2. "**$**" = Matches the end of the string or just before the newline at the end of the string.

  3. "**^**" = Matches the start of the string.

  4. "**{m}**" = Specifies that exactly *m* copies of the previous RE should be matched.

  5. "**[...]**" = Used to indicate a set of characters.

  6. "**\\**" = Either escapes special characters (permitting you to match characters like '*, '?', and so forth), or signals a special sequence

  7. "**\|**" = A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. When one pattern completely matches, that branch is accepted and no further patterns are tested.

  8. "**(...)**" = Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group.

* The following *metacharacters* are repeating qualifiers:

  1. "*" = Causes the resulting RE to match 0 or more repetitions of the preceding RE. It will match as much text as possible (i.e. greedy).

  2. "**+**" = Causes the resulting RE to match 1 or more repetitions of the preceding RE. It will match as much text as possible (i.e. greedy).

  3. "**?**" = Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. It will match as much text as possible (i.e. greedy).

  4. "**{m, n}**" = Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible (i.e. greedy).

## **Special sequences**

* "\\" can signal a special sequence.

* Some of the special sequences represent predefined sets of characters that are often useful.

* These include:

  1. **\d** = Matches any decimal digit; this is equivalent to the class [0-9].

  2. **\D** = Matches any non-digit character; this is equivalent to the class [^0-9].

  3. **\s** = Matches any whitespace character.

  4. **\S** = Matches any non-whitespace character.

  5. **\w** = Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

  6. **\W** = Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].


# **Using regular expressions**

## **Pattern objects**

* Regular expressions are compiled into pattern objects.

### **Compiling regular expressions**

* Use [*re.compile()*](https://docs.python.org/3/library/re.html#re.compile) to create a pattern object.

* This function also accepts an optional flags argument, used to enable various special features and syntax variations.

In [None]:
### Re.compile(pattern, flags=0)
pattern = 'ab*'
patternObj = re.compile(pattern)
print(type(patternObj))
print(patternObj)

<class 're.Pattern'>
re.compile('ab*')


### **Dealing with backslashes**

**The problem:**

* Regular expressions use the backslash character ('\\') to indicate special forms or to allow special characters to be used without invoking their special meaning.

* This conflicts with Python’s usage of the same character for the same purpose in string literals.

**The solution:**

* Use Python’s raw string notation for regular expressions.

* Backslashes are not handled in any special way in a string literal prefixed with 'r'.

In [None]:
### Re.compile(pattern, flags=0)
pattern = r'ab*'
patternObj = re.compile(pattern)
print(type(patternObj))
print(patternObj)

<class 're.Pattern'>
re.compile('ab*')


### **Performing matches**

* Pattern objects have methods for various operations.

* These include:

  1. *match()* = Determine if the RE matches at the beginning of the string.

  2. *search()* = Scan through a string, looking for any location where this RE matches.

  3. *findall()* = Find all substrings where the RE matches, and returns them as a list.

  4. *finditer()* = Find all substrings where the RE matches, and returns them as an iterator.

#### **Match()**

* *Match()* determines if the RE matches *at the beginning of the string*.

* If no match is found, it returns *None*.

* If a match is found, a match object is returned.

* Match objects contain information about the match and can be queried using the following methods:

  1. *group()* = Return the string matched by the RE

  2. *start()* = 	Return the starting position of the match

    **Note:** Since *match()* only checks if the RE matches at the start of a string,this will always be zero.

  3. *end()* = Return the ending position of the match

  4. *span()* = Return a tuple containing the (start, end) positions of the match

In [None]:
### Example 1 - no match can be found using match()

string = ""

# Create pattern object
patternObject = re.compile('[a-z]+')

# Match string
matchResult = patternObject.match(string)

# Print match object
print(type(matchResult))
print(matchResult)

<class 'NoneType'>
None


In [None]:
### Example 2 - a match can be found using match()

string = "house"

# Create pattern object
patternObject = re.compile('[a-z]+')

# Match string
matchResult = patternObject.match(string)

# Print match object
print(type(matchResult))
print(matchResult)

<class 're.Match'>
<re.Match object; span=(0, 5), match='house'>


In [None]:
### Example 2 - query the match object
# Note: Store the match object in a variable, and then check if it's None

# Return the string matched by the RE using group()
if patternObject:
  print(matchResult.group())
else:
    print("No match")

# Return the starting position of the match using start()
if patternObject:
  print(matchResult.start())
else:
    print("No match")

# Return the ending position of the match using end()
if patternObject:
  print(matchResult.end())
else:
    print("No match")

# Return a tuple containing the (start, end) positions of the match using span()
if patternObject:
  print(matchResult.span())
else:
    print("No match")

house
0
5
(0, 5)


#### **Search()**

* *Search()* scans through a string, looking for *any location where the RE matches*.

* If no match is found, it returns *None*.

* If a match is found, a match object is again returned.

* Match objects contain information about the match and can be queried using the following methods:

  1. *group()* = Return the string matched by the RE

  2. *start()* = 	Return the starting position of the match

  3. *end()* = Return the ending position of the match

  4. *span()* = Return a tuple containing the (start, end) positions of the match

In [None]:
### Example 1 - no match can be found using search()

string = ""

# Create pattern object
patternObject = re.compile('[a-z]+')

# Search string
searchResult = patternObject.search(string)

# Print match object
print(type(searchResult))
print(searchResult)

<class 'NoneType'>
None


In [None]:
### Example 2 - a match can be found using search()

string = "I live in a house"

# Create pattern object
patternObject = re.compile('house')

# Search string
searchResult = patternObject.search(string)

# Print match object
print(type(searchResult))
print(searchResult)

# Return the string matched by the RE using group()
if patternObject:
  print(searchResult.group())
else:
    print("No match")

# Return the starting position of the match using start()
if patternObject:
  print(searchResult.start())
else:
    print("No match")

# Return the ending position of the match using end()
if patternObject:
  print(searchResult.end())
else:
    print("No match")

# Return a tuple containing the (start, end) positions of the match using span()
if patternObject:
  print(searchResult.span())
else:
    print("No match")

<class 're.Match'>
<re.Match object; span=(12, 17), match='house'>
house
12
17
(12, 17)


#### **Findall()**

* *Findall()* scans through a string and finds all substrings where the RE matches.

* If no match is found, it returns an empty string.

* If matches are found, it returns the substrings as a list.

In [None]:
### Example 1 - no matches can be found using findall()

string = ""

# Create pattern object
patternObject = re.compile('[a-z]+')

# Search string
findallResult = patternObject.findall(string)

# Print list
print(type(findallResult))
print(findallResult)

<class 'list'>
[]


In [None]:
### Example 2 - matches can be found using findall()

string = "I live in a house near the other house"

# Create pattern object
patternObject = re.compile('house')

# Search string
findallResult = patternObject.findall(string)

# Print list
print(type(findallResult))
print(findallResult[1])

<class 'list'>
house


In [None]:
### Example 3 - matches can be found using findall()

string = "Task: Ask a question \nSpecification: Give first example \n Task: Ask a question \nSpecification: Give second example"

# Create pattern object
patternObject = re.compile('Task:.*\nSpecification:.*')

# Search string
findallResult = patternObject.findall(string)

# Print list
print(type(findallResult))
print(findallResult)

<class 'list'>
['Task: Ask a question \nSpecification: Give first example ', 'Task: Ask a question \nSpecification: Give second example']


#### **Finditer()**

* *Finditer()* scans through a string, looking for any location where the RE matches.

* It returns a sequence of match object instances as an iterator.

* These match objects can also be queried using the following methods:

  1. *group()* = Return the string matched by the RE

  2. *start()* = 	Return the starting position of the match

  3. *end()* = Return the ending position of the match

  4. *span()* = Return a tuple containing the (start, end) positions of the match

In [None]:
### Example 1 - no match can be found using finditer()

string = ""

# Create pattern object
patternObject = re.compile('[a-z]+')

# Search string
searchResult = patternObject.finditer(string)

# Print iterable
print(type(searchResult))
print(searchResult)

# Print match objects
for matchObject in searchResult:
  print(type(matchObject))
  print(matchObject)

<class 'callable_iterator'>
<callable_iterator object at 0x7f0b923a8b10>


In [None]:
### Example 2 - a match can be found using finditer()

string = "I live in a house near the other house"

# Create pattern object
patternObject = re.compile('house')

# Search string
finditerResult = patternObject.finditer(string)

for matchObject in finditerResult:
  print(type(matchObject))
  print(matchObject)

<class 're.Match'>
<re.Match object; span=(12, 17), match='house'>
<class 're.Match'>
<re.Match object; span=(33, 38), match='house'>


## **Grouping**

* "**(...)**" = Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group.

* You can repeat the contents of a group with a repeating qualifier, such as **\***, **+**, **?**, or **{m,n}**.

* For example, *(ab)** will match zero or more repetitions of ab.

In [None]:
### Use groups

string = "abab"

# Create pattern object
patternObject = re.compile('(ab)*')

# Match string
matchResult = patternObject.match(string)

# Print match object
print(type(matchResult))
print(matchResult)

<class 're.Match'>
<re.Match object; span=(0, 4), match='abab'>


* The [Group()](https://docs.python.org/3/library/re.html#re.Match.group) method returns one or more subgroups of the match.

* Groups are numbered starting with 0. Without arguments, *Group()* defaults to zero.

  **Note:** Group 0 is always present; it’s the whole RE, and returns the whole match.

In [None]:
### Print the different groups

string = "abcd"

# Create pattern object
patternObject = re.compile('(ab)(cd)')

# Match string
matchResult = patternObject.match(string)

# # Print match object
# print(type(matchResult))
# print(matchResult)

# Print group 0
print("Group 0:", matchResult.group())

# Print group 1
print("Group 1:", matchResult.group(1))

# Print group 2
print("Group 2:", matchResult.group(2))

Group 0: abcd
Group 1: ab
Group 2: cd


In [None]:
### Print the different groups

string = "abababab"

# Create pattern object
patternObject = re.compile('(ab)*')

# Match string
matchResult = patternObject.match(string)

# # Print match object
# print(type(matchResult))
# print(matchResult)

# Print group 0
print("Group 0:", matchResult.group())

# Print group 1
print("Group 1:", matchResult.group(1))

# Print group 2
print("Group 2:", matchResult.group(2))

Group 0: abababab
Group 1: ab


IndexError: ignored

In [None]:
### Print the different groups

string = "abababab"

# Create pattern object
patternObject = re.compile('(ab)')

# Match string
matchResult = patternObject.findall(string)

# # Print match object
# print(type(matchResult))
# print(matchResult)

# Print group 0
print("Group 0:", matchResult[0])

# Print group 1
print("Group 1:", matchResult[1])

# Print group 2
print("Group 2:", matchResult[2])

# Print group 2
print("Group 3:", matchResult[3])

Group 0: ab
Group 1: ab
Group 2: ab
Group 3: ab


## **Greedy versus Non-Greedy**

* Repeating qualifiers such as **\***, **+**, **?** and **{m, n}** are greedy.

* When repeating a RE, the matching engine will try to repeat it as many times as possible.

* If later portions of the pattern don’t match, the matching engine will then back up and try again with fewer repetitions.

* In the below example this means the final 'd' is excluded from the group despite it being matched by **.***

In [None]:
### Greedy match using group

string = "my name is billy pitchford"

# Create pattern object
patternObject = re.compile('(.*)d')

# Match string
matchResult = patternObject.match(string)

# Print match object
print(type(matchResult))
print(matchResult)

# Print group
print("\n", "Group:", matchResult.group(1))

<class 're.Match'>
<re.Match object; span=(0, 26), match='my name is billy pitchford'>

 Group: my name is billy pitchfor


* The solution is to please a **?** after the qualifier.

* The non-greedy qualifiers ***?**, **+?**, **??**, or **{m,n}?**, match as little text as possible.

* In the below example this means only the first 'm' is captured despite it all being matched by **.***

In [None]:
### Non-greedy match using group

string = "my name is billy pitchford"

# Create pattern object
patternObject = re.compile('(.+?)')

# Match string
matchResult = patternObject.match(string)

# Print match object
print(type(matchResult))
print(matchResult)

# Print group
print("\n", "Group:", matchResult.group(1))

<class 're.Match'>
<re.Match object; span=(0, 1), match='m'>

 Group: m


## **Compilation flags**

* Compilation flags let you modify how regular expressions work.

* Flags are available in the *re* module under two names, a long name (e.g. IGNORECASE) and a short form (e.g. I).

### **Ignorecase (I)**

When this flag is specified:

* Perform case-insensitive matching; character class and literal strings will match letters by ignoring case.

In [None]:
### Without Ignorecase

# Create pattern object
patternObject = re.compile('hello')

# Match string
string = "HELLO"
matchResult = patternObject.match(string)

# Print match object
print(type(matchResult))
print(matchResult)

In [None]:
### With Ignorecase

# Create pattern object
patternObject = re.compile('hello', re.IGNORECASE)

# Match string
string = "HELLO"
matchResult = patternObject.match(string)

# Print match object
print(type(matchResult))
print(matchResult)

### **Multiline (M)**

When this flag is specified:

* ^ matches at the beginning of the string and at the beginning of each line within the string (immediately following each newline).

* $ matches either at the end of the string and at the end of each line (immediately preceding each newline).

In [None]:
### Without MULTILINE

# Create pattern object
patternObject = re.compile(r"^Surname")

# Match string
string = """Name Billy\nSurname Pitchord"""

matchResult = patternObject.search(string)

# Print match object
print(type(matchResult))
print(matchResult)

In [None]:
### With MULTILINE

# Create pattern object
patternObject = re.compile(r"^Surname", re.MULTILINE)

# Match string
string = """Name Billy\nSurname Pitchord"""

matchResult = patternObject.search(string)

# Print match object
print(type(matchResult))
print(matchResult)

### **Dotall (S)**

When this flag is specified:

* '.' matches any character including a newline (without this flag, it matches anything except a newline).

In [None]:
### Without DOTALL

# Create pattern object
patternObject = re.compile(r".*")

# Match string
string = "Name Billy\nSurname Pitchord"

matchResult = patternObject.match(string)

# Print match object
print(type(matchResult))
print(matchResult)

In [None]:
### With DOTALL

# Create pattern object
patternObject = re.compile(r".*", re.DOTALL)

# Match string
string = "Name Billy\nSurname Pitchord"

matchResult = patternObject.match(string)

# Print match object
print(type(matchResult))
print(matchResult)

### **Verbose (X)**

When this flag is specified:

* whitespace within the RE string is ignored (except when the whitespace is in a character class or preceded by an unescaped backslash)

* you can put comments within a RE that will be ignored by the engine (comments are marked by a '#' that’s neither in a character class or preceded by an unescaped backslash)

In [None]:
### Without VERBOSE

string = "Jessa is a Python developer, and her salary is 8000"

# Create pattern object
patternObject = re.compile(r"""(^\w{2,}) # match 5-letter word at the start
                        .+ # match one or more of any character
                        (\d{4}$) # match 4-digit number at the end """)

# Match string
matchResult = patternObject.match(string)

# Print match object
print(type(matchResult))
print(matchResult)

In [None]:
### With VERBOSE

string = "Jessa is a Python developer, and her salary is 8000"

# Create pattern object
patternObject = re.compile(r"""(^\w{2,}) # match 5-letter word at the start
                        .+ # match one or more of any character
                        (\d{4}$) # match 4-digit number at the end """,
                        re.VERBOSE)

# Match string
matchResult = patternObject.match(string)

# Print match object
print(type(matchResult))
print("\n","Match: ", matchResult.group())
print("\n","Group 1: ", matchResult.group(1))
print("\n","Group 2: ", matchResult.group(2))

## **Search and replace**

* Matches for a pattern can be replaced with a different string.

* This can be done using the [*sub()*](https://docs.python.org/3/library/re.html#re.Pattern.sub) or [*subn()*](https://docs.python.org/3/library/re.html#re.Pattern.sub) methods for a pattern object.

* These take a replacement value, which can be either a string or a function, and the string to be processed.

  **Note:** By default, they replace all occurrences.


In [None]:
### Search and replace

string = "Question: Is your name Billy? Answer: My name is Billy Pitchford."
replacement = "Anna"

# Create pattern object
patternObject = re.compile('Billy')

# Perform substitution
subResult = patternObject.subn(replacement, string)

# Print match object
print(type(subResult))
print(subResult[0])

# **Project output parsing**

## **Extract specification**

In [None]:
string = """1. Plot a scatter plot of production budget against worldwide gross:
{"$schema": "https://vega.github.io/schema/vega-lite/v4.json", "mark": {"type": "point", "tooltip": true}, "encoding": {"x": {"field": "Production Budget", "type": "quantitative", "aggregate": null, "axis": {"format": "s"}}, "y": {"field": "Worldwide Gross", "type": "quantitative", "aggregate": null, "axis": {"format": "s"}}, "tooltip": {"field": "Title"}}, "data": {"url": "https://raw.githubusercontent.com/nl4dv/nl4dv/master/examples/assets/data/movies-w-year.csv", "format": {"type": "csv"}}}
2. Plot a scatter plot of production budget against IMDB rating:
{"$schema": "https://vega.github.io/schema/vega-lite/v4.json", "mark": {"type": "point", "tooltip": true}, "encoding": {"x": {"field": "Production Budget", "type": "quantitative", "aggregate": null, "axis": {"format": "s"}}, "y": {"field": "IMDB Rating", "type": "quantitative", "aggregate": null, "axis": {"format": "s"}}, "tooltip": {"field": "Title"}}, "data": {"url": "https://raw.githubusercontent.com/nl4dv/nl4dv/master/examples/assets/data/movies-w-year.csv", "format": {"type": "csv"}}}
3. Plot a scatter plot of worldwide gross against IMDB rating:
{"$schema": "https://vega.github.io/schema/vega-lite/v4.json", "mark": {"type": "point", "tooltip": true}, "encoding": {"x": {"field": "Production Budget", "type": "quantitative", "aggregate": null, "axis": {"format": "s"}}, "y": {"field": "Worldwide Gross", "type": "quantitative", "aggregate": null, "axis": {"format": "s"}}, "tooltip": {"field": "Title"}}, "data": {"url": "https://raw.githubusercontent.com/nl4dv/nl4dv/master/examples/assets/data/movies-w-year.csv", "format": {"type": "csv"}}}

Any insights appreciated!

A:

It is possible that"""

In [None]:
### Extract generated specification

# Create pattern object
patternObject = re.compile(r'1.*?({.*}).*?2.*3.*', re.DOTALL)

# Search string
searchResult = patternObject.search(string)

# Print groups
# print(searchResult.group())
print(searchResult.span(1))
print(searchResult.group(1))

## **Replace relevant values**

* true needs to be replaced with True

* null needs to be replaced with None

In [None]:
### Search and replace true with True

parsedOutput = searchResult.group(1)
replacement = "True"

# Create pattern object
patternObject = re.compile('true')

# Perform substitution
subResult1 = patternObject.subn(replacement, parsedOutput)

# Print match object
print(type(subResult1))
print(subResult1[0])

In [None]:
### Search and replace null with None

replacement = "None"

# Create pattern object
patternObject = re.compile('null')

# Perform substitution
subResult2 = patternObject.subn(replacement, subResult1[0])

# Print match object
print(type(subResult2))
print(subResult2[0])

## **Parse json string into a dictionary**

* The json string needs to be converted to a dictionary so that it can be visualised downstream.

* Convert json strings to dictionaries using [json.loads()](https://docs.python.org/3/library/json.html) or [ast](https://docs.python.org/3/library/ast.html) module.

  **Note:** json.loads() isn't working here for some reason.

In [None]:
### Print output type
print(type(subResult2[0]))

In [None]:
### Convert json string into dictionary

# Using ast module
# Source: https://docs.python.org/3/library/ast.html
import ast
spec = ast.literal_eval(subResult2[0])
spec

# # Using json module
# Source: https://docs.python.org/3/library/json.html
# import json
# # spec = json.loads(subResult2[0])
# # spec

In [None]:
### Render the vega-lite specifications using the Altair HTML renderer and IPython display function
# Source: https://stackoverflow.com/questions/64393567/how-to-render-vega-lite-viz-in-google-colab
import altair as alt
from IPython.display import display

display(alt.display.html_renderer(spec), raw=True)

# **References**

* https://docs.python.org/3/library/re.html
* https://docs.python.org/3/howto/regex.html#regex-howto
* https://developers.google.com/edu/python/regular-expressions