# 0. Python RegEx modules

In [1]:
import re

Python has an **re** module that supports regular experssions for strings

**Note** Becareful of Python **Raw strings**
Pyhton has escape characters built into string evaluation, such as the newline character
* For example, ```\n, \t``` are escape charaters

In [2]:
print("hello\nthere!") # \n for new line

hello
there!


In [3]:
print("hello\tthere!") # \t for tab

hello	there!


However, sometimes we want to treat these as **raw strings**, then we need **```print(r"string")```**. Particularly, we hope to use raw strings in RegEx modules

In [4]:
print(r"hello\nthere!") # using print(r"str") to print raw string

hello\nthere!


## 0.1. Python match objects

The **re** module has methods that attempt to match a pattern to a string - if they **find a match**, they will 
return a **Match object**, and **if they don't** they will return **None**

* 
```python
    re.search(<pattern>, <string>)
```
    * Returns a Match object representing the first occurence of ```<pattern>```in ```<string>```    
    
* 
```python
    re.fullmatch(<pattern>, <string>)
```
    * Returns a Match object requiring that ```<pattern>``` **entirely match** ```<string>``` 
    
* 
```python
    re.match(<pattern>, <string>)
```
    * Returns a Match object representing that ```<string>```must start with a substring that matches ```<pattern>``` 

The returned match object has properties and methods used to retrieve information about the search, and result

* ```.span()``` returns a **tuple** containing the **start, and the end position** of the match
* ```.string``` returns the **string** passed into the function
* ```group()``` returns the **part of the string** where there was a **match**

## 0.2. Other re functions
These are functions from **re** module that don't return a match objects

* 
```python
    re.findall(<pattern>, <string>)
```
    * Returns a list of all substrings with ```<pattern>``` that match ```<string>```, read from left to right   
    
* 
```python
    re.sub(<pattern>, <repl>, <string>)
```
    * Returns a ```<string>```, but with all instances of ```<pattern>``` replaced wiht ```<repl>```

# 1. General RegEx Patterns

## 1. 0. Matching exact strings & Special characters

The following are **special characters** in regular expressions: 
```python
    \ ( ) [ ] { } + * ? | $ ^ .
```

* Matching a string without special character, just use string
* Matching string with **special character**, use **backslash** for **escaping**: e.g. ```\(``` to match ```(```

## 1.1. The dot
The dot ```.``` matches **any single character** that is not a new line:

```.a.a.a``` <=> ```banana```
but cannot be matched by ```a```

In [5]:
pattern = '.a.a.a'

str1 = '!baaaa' # should not be a match
match1 = re.match(pattern, str1)
print(f"match str1 {match1}")

str2 = 'iakakadsadasdsadsaferwqa' # Should be a match
match2 = re.match(pattern, str2)
print(f"match str2 {match2}")

str3 = 'iiakakadsadasdsadsaferwqa' # Should not be a match
match3 = re.match(pattern, str3)
print(f"match str3 {match3}")

match str1 None
match str2 <re.Match object; span=(0, 6), match='iakaka'>
match str3 None


## 1.2. Character classes
Character classes match **any** of a **set of characters** - **one instance of character class** will match **exactly one** character

```[ab]c[ab]c``` <=> ```bcac```

* ```[a-z]``` means the corresponding character can be **anything** between a and z
* ```[0-9]```means the corresponding character can be **anything** between 0 and 9

### 1.2.a. Notes: Shorthands for common character classes

| Shorthands| Explanation |
| --- | --- |
| ```.``` | Matches **ANY** non-newline character |
| ```[^]``` | Matches **ANY** character except what ever comes after ```^```|
| ```\d``` | Matches **digits**, equivalent to ```[0-9]``` |
| ```\D``` | Matches any **NON-digits**, opposite of ```\d```, shortcuts below also have opposites|
| ```\w``` | Matches **ANY word characters**, equivalent to ```[A-Za-z0-9_]``` (*opposite of ```\W```*)|
| ```\s``` | Matches **white space characters** (spaces, tabs, linebreaks), (*opposite of ```\S```*) |

### 1.2.b. Examples
#### Example 1
```"[^a-h]ef"```, the last two chars have to be ```"ef"```. But the first character can only be chars that are not in the range of ```[a-h]```

In [6]:
pattern = "[^a-h]ef"
str1 = "9ef" # should match, as 9 is out of [a-h]
match1 = re.match(pattern, str1)
print(f"match str1 {match1}")

pattern = "[^a-h]ef"
str2 = "hef" # should not match, as h is in the range of [a-h]
match2 = re.match(pattern, str2)
print(f"match str1 {match2}")

pattern = "[^a-h]ef"
str3 = "kef" # should match, as k is out of [a-h]
match3 = re.match(pattern, str3)
print(f"match str1 {match3}")

match str1 <re.Match object; span=(0, 3), match='9ef'>
match str1 None
match str1 <re.Match object; span=(0, 3), match='kef'>


#### Example 2
```\(\d\d\d\) \d\d\d-\d\d\d\d``` should only match a series of digits: (NNN) NNN-NNNN

In [7]:
pattern = "\(\d\d\d\) \d\d\d-\d\d\d\d"
str1 = "(615) 839-7368" # should match, as str1 is an exact match
match1 = re.match(pattern, str1)
print(f"match str1 {match1}")

pattern = "\(\d\d\d\) \d\d\d-\d\d\d\d"
str2 = "(615)-839-7368" # should match, as str1 is an illegal match
match2 = re.match(pattern, str2)
print(f"match str1 {match2}")

match str1 <re.Match object; span=(0, 14), match='(615) 839-7368'>
match str1 None


## 1.3. Quatifiers
Quantifiers allow us to specify multiple occurences of the **same charcter or character class**

| Quantifiers | Explanation |
| --- | --- |
| ```a*``` | **zero or more** occurences of ```a``` |
| ```a+``` | **one or more** occurences of ```a```|
| ```a?``` | **zero or one** occurences of ```a``` |
| ```a{2}``` | **Exactly 2** occurences of ```a```|
| ```a{2, 6}``` | **2 ~ 6** occurences (including 2 and 6) of ```a```|
| ```a{2, }``` | **At least 2** occurences of ```a```|

### 1.3.a. Examples
#### Example 1
```a*b```, asks for zero or more occurence of ```a```, i.e. As long as a is not something else occupying that slot, everything should be fine. 

In [8]:
pattern = "A*B"
str1 = "B" # should match, as the place of A is not occupied
match1 = re.match(pattern, str1)
print(f"match str1 {match1}")

pattern = "A*B"
str2 = "CB" # should not match, as the place of A is taken by C
match2 = re.match(pattern, str2)
print(f"match str1 {match2}")

pattern = "A*B"
str3 = "AAAAB" # should match, as the place of A is taken by several occurence of A s
match3 = re.match(pattern, str3)
print(f"match str1 {match3}")

match str1 <re.Match object; span=(0, 1), match='B'>
match str1 None
match str1 <re.Match object; span=(0, 5), match='AAAAB'>


### Example 2
```a+b```, asks for ```a``` appear for at least once. Comparing with the last case, ```_b```
is not acceptable anymore

In [9]:
pattern = "a+B"
str1 = "B" # should not match, as the a has be appear for at least once
match1 = re.match(pattern, str1)
print(f"match str1 {match1}")

pattern = "a+B"
str2 = "AaaaB" # should not match, as the place of a is taken by AaaaB, A doesn't match
match2 = re.match(pattern, str2)
print(f"match str1 {match2}")

pattern = "a+B"
str3 = "aaaB" # should match, as the place of a us 
match3 = re.match(pattern, str3)
print(f"match str1 {match3}")

match str1 None
match str1 None
match str1 <re.Match object; span=(0, 4), match='aaaB'>


### Example 3
```\(\d{3}\) \d{3}-\d{4}``` expects to see the pattern: ```(3digits) 3digits-4digits```

In [10]:
pattern = "\(\d{3}\) \d{3}-\d{4}"
str1 = "(615) 839-7368" # should match, as str1 is an exact match
match1 = re.match(pattern, str1)
print(f"match str1 {match1}")

pattern = "\(\d{3}\) \d{3}-\d{4}"
str2 = "(615)-839-7368" # should match, as str1 is an illegal match
match2 = re.match(pattern, str2)
print(f"match str1 {match2}")

match str1 <re.Match object; span=(0, 14), match='(615) 839-7368'>
match str1 None


## 1.4. Combining Patterns ?
The pipe ```|``` operator matches either the expresion on its left or its right, but **not both**

In [11]:
pattern = "\d+|Inf"
str1 = "798" # LHS
match1 = re.fullmatch(pattern, str1)
print(f"match str1 {match1}")

match str1 <re.Match object; span=(0, 3), match='798'>


In [12]:
pattern = "\d+|Inf"
str1 = "Inf" # should match, RHS
match1 = re.fullmatch(pattern, str1)
print(f"match str1 {match1}")

match str1 <re.Match object; span=(0, 3), match='Inf'>


In [13]:
pattern = "\d+|Inf"
str1 = "798Inf" # should match, RHS
match1 = re.fullmatch(pattern, str1)
print(f"match str1 {match1}")

match str1 None


#### 1.4.1. Using parentheses to group expressions

In [14]:
pattern = "(>3)+"
str1 = ">3>3>3" # should match, RHS
match1 = re.fullmatch(pattern, str1)
print(f"match str1 {match1}")

match str1 <re.Match object; span=(0, 6), match='>3>3>3'>


In [15]:
pattern = "(>3)+"
str1 = ">3>3<3" # should match, RHS
match1 = re.fullmatch(pattern, str1)
print(f"match str1 {match1}")

match str1 None


## 1.5. Anchors
Anchors are **unique** in that they **don't match characters** - instead, they **match positions** in a string where an expression could land

| Anchors | Explanation |
| --- | --- |
| ```^``` | matches the **beginning** of a string |
| ```$``` | matches the **end** of a string|
| ```\b``` | matches the **word boundary** (whitespace, punctuation)|

### 1.5.1. Anchor Examples
#### Example.1. ```^``` match the beginning

In [16]:
pattern = "^aw*"
str1 = " aww" # should not match, ^ is not the beginning of the string
match1 = re.fullmatch(pattern, str1)
print(f"match str1 {match1}")

match str1 None


In [17]:
pattern = "^aw?"
str1 = "aw" # should not match, ^ is the start of the string
match1 = re.fullmatch(pattern, str1)
print(f"match str1 {match1}")

match str1 <re.Match object; span=(0, 2), match='aw'>


#### Example.2. ```$``` match the end

In [18]:
pattern = "\w+y$"
str1 = "away" # should match, y is immediately followed by the end of string
match1 = re.fullmatch(pattern, str1)
print(f"match str1 {match1}")

match str1 <re.Match object; span=(0, 4), match='away'>


In [19]:
pattern = "\w+y$"
str1 = "away " # should not match, y is not immediately followed by the end of string
match1 = re.fullmatch(pattern, str1)
print(f"match str1 {match1}")

match str1 None


#### Example.3. ```\b``` match the boundary. FAIL to test? <= DO NOT FORGET USING RAW string 

In [20]:
pattern = r"\w+e\b"
str1 = "bridge " # should match, y is immediately followed by the end of string
match1 = re.fullmatch(pattern, str1)
print(f"match str1 {match1}")

match str1 None


# 2. Using Python re modules
## 2.1. Easier Usages

**```re``` methods**
* 
```python
    re.search(<pattern>, <string>)
```
    * Returns a Match object representing the first occurence of ```<pattern>```in ```<string>```    
    
* 
```python
    re.fullmatch(<pattern>, <string>)
```
    * Returns a Match object requiring that ```<pattern>``` **entirely match** ```<string>``` 
    
* 
```python
    re.match(<pattern>, <string>)
```
    * Returns a Match object representing that ```<string>```must start with a substring that matches ```<pattern>``` 
    
**Attributes of ```re.Match``` object**
* ```.span()``` returns a **tuple** containing the **start, and the end position** of the match
* ```.string``` returns the **string** passed into the function
* ```group()``` returns the **part of the string** where there was a **match**

In [21]:
x = "This string contains 35 characters."
match = re.search(r'\d+', x) # Trying to find number with any digits
print(match)

<re.Match object; span=(21, 23), match='35'>


In [22]:
print(x[match.span()[0]: match.span()[1]]) # Print that number using match.span()

35


In [23]:
match.group(0)

'35'

In [24]:
match2 = re.search(r'\d{3, }', x) # Trying to find a number with 3 or more than 3 consecutive digits
print(match2)

None


## 2.2. Using capturing groups
When use parentheses to group sub expressions, they define **capture groups** that we can then access individually

In [25]:
x = "There were 12 pence in a shilling and 20 shillings in a pound."
mat = re.search(r'(\d+) [a-z\s]+ (\d+)', x) # Find two numbers that a seperated by [string\s]

In [26]:
print(mat)

<re.Match object; span=(11, 40), match='12 pence in a shilling and 20'>


In [27]:
print(mat.group())

12 pence in a shilling and 20


In [28]:
mat.group(0)

'12 pence in a shilling and 20'

In [29]:
mat.group(1)

'12'

In [30]:
mat.group(2)

'20'

In [31]:
mat.groups()

('12', '20')

##  2.3. Other re functions
These are functions from **re** module that don't return a match objects

* 
```python
    re.findall(<pattern>, <string>)
```
    * Returns a **list** of **all substrings** with ```<pattern>``` that match ```<string>```, read from left to right   
    
* 
```python
    re.sub(<pattern>, <repl>, <string>)
```

* Returns a ```<string>```, but with all instances of ```<pattern>``` replaced with ```<repl>```

In [32]:
all_mat = re.findall(r'\d+', x)

In [33]:
all_mat = re.sub(r'\d+', str(66),  x)

**Seems that The replacement is not conducted in-place**

In [34]:
print(all_mat)

There were 66 pence in a shilling and 66 shillings in a pound.


In [35]:
print(x)

There were 12 pence in a shilling and 20 shillings in a pound.


# 3. Real-world examples
## 3.1. Trimming the trailing space of every line

We want to trim the trailing spaces in a txt file looks like:
```
this lin,,,...e with spaces            
This line wi     th spaces      
this lingfdgde with spac     es            
This line with  spaces  this line with spaces            
This line with spaces        
this line with s326paces               
This li   ne w,,,ith spaces  this line with spaces            
This line w   ith spaces  
```

In [37]:
import re

filein = open('example.txt', 'r')
lines = filein.readlines()
lines = [re.sub(r'\s+$', '', l) for l in lines] # Search consecutive space followed by end of string, 
# and replace that string with ''
out_f = open('fixed.txt', 'w')
out_f.writelines(['\n' + l for l in lines])
filein.close()
out_f.close()

## 3.2. Batch Processing Files

In my most recent research work, I need to parse a lot of GPU FI results for ploting and analysis. The FI result statistics has the following name format:
```
vectorAdd1024_1_1.csv
vectorAdd6188_1024_4.csv
vectorAdd16566_256_16.csv
```
Which means:
```<Benchmarkname><InputSize>_<BlockSize>_<GridSize>.csv```
Let's parse the filename to a dictionary

In [80]:
import glob
import re
import sys, os

In [81]:
result_files = [os.path.basename(x) for x in glob.glob("./CUDA_vectorAdd_traces/*.csv")]

In [82]:
print(result_files)

['vectorAdd102488_1024_16.csv', 'vectorAdd1024_1_1.csv', 'vectorSub6188_256_16.csv']


In [88]:
res = {'benchmarkname': [],
       'problemsize': [],
       'blockSize': [],
       'gridSize': []
      }
for fname in result_files:
    mat = re.match(r"(\w+)(\d+)_(\d+)_(\d+)\.csv", fname) # The file name is well formatted, simply using re
    # to find the value of a string, and 3 numbers
    assert mat is not None, "Must match file format"
    benchmarkname, problemsize, blockSize, gridSize = mat.groups()
    res['benchmarkname'].append(benchmarkname)
    res['problemsize'].append(problemsize)
    res['blockSize'].append(blockSize)
    res['gridSize'].append(gridSize)

In [89]:
print(res)

{'benchmarkname': ['vectorAdd10248', 'vectorAdd102', 'vectorSub618'], 'problemsize': ['8', '4', '8'], 'blockSize': ['1024', '1', '256'], 'gridSize': ['16', '1', '16']}
