## Raw strings

A raw string is just a string prefixed with `r`.

For example:

- `a = r"this is a \n raw string"`: `a` is a raw string.
- `b = "this is a \n raw string"`: `b` is not a raw string.



In [6]:
a = r"this is a \n raw string"
b = "this is a \n raw string"
print(type(a)), print(type(b))

<class 'str'>
<class 'str'>


(None, None)

A raw string is also a `str` object

In [32]:
type(a), type(b)

(str, str)

In [33]:
print(a) # The raw string does not print the \n
print(b) # The standard string understands \n is a new line

this is a \n raw string
this is a 
 raw string


## Compiling Regular expressions

We would like to get rellevant information from strings. For example
we would like to parse phone numbers or urls.

In [11]:
text = """
If you want this call me at 689878997.
Mr Albert did not access www.piratetorrents.com
Alba was upsed because eseOese.com was not working.
"""

### findall

#### `pattern=re.compile(regex)` and `pattern.findall`

- Generate a regular expression using 

    - **`p = re.compile(r'our_regular_expression')`**.
    - `p` will be a  `SRE_Pattern` object that we can apply to any string.


- **`p.findall(s)`** returns a list containing all substrings within `s` that safisfy our regular expression.




In [105]:
## Ex: get all text that starts with "Al" until the end of the line

# Albert did not access www.piratetorrents.com
# Alba was upsed because eseOese.com was not working.

In [12]:
import re
pattern = re.compile(r'Alb.*')
all_matches = pattern.findall(text)
all_matches

['Albert did not access www.piratetorrents.com',
 'Alba was upsed because eseOese.com was not working.']

### finditer

- **`p.finditer(s)`** returns a generator. Each element of the generator is a `SRE_Match`which contains 
    - **`span(x,y)`** indicating that the match starts at position `x` and ends in position `y` (of s).
    - **`match='...'`** indicating the string that satisfies the regular expression.



In [13]:
import re
pattern = re.compile(r'Al')
matches = pattern.finditer(text)
matches = [x for x in matches]

In [14]:
matches

[<re.Match object; span=(43, 45), match='Al'>,
 <re.Match object; span=(88, 90), match='Al'>]

In [15]:
beg, end = matches[0].span()
text[beg:end]

'Al'

#### Substitution

- **`p.sub(r'XXX', s)`** change all patterns by `XXX`





## Regular expressions


### Basic matching

- **`.`** Matches any character except new line.


- **`\d`** Matches any digit (0-9). Equivalent to `[0-9]`.


- **`\D`** Matches any NON digit. Equivalent to `[^0-9]`.


- **`\w`** Matches any "word character". Equivalent to `[a-zA-Z0-9_]`


- **`\W`** Matches any NON alphanumeric character. Equivalent to `[^a-zA-Z0-9_]`


- **`\s`** Matches any whitespace (space, tab, newline). Equivalent to `[\t\n\r\f\v]` 


- **`\S`** Matches any NON whitespace (space, tab, newline). Equivalent to `[^ \t\n\r\f\v]` 


In [34]:
### Exercie: Find all words starting with "Alb"
text = """
If you want this call me at 689878997.
Mr Albert did not access www.piratetorrents.com
Alba was upsed because eseOese.com was not working.
"""

In [90]:
pattern     = re.compile(r'Alb\w*')
all_matches = pattern.findall(text)
all_matches

['Albert', 'Alba']

Exercise: Find all words that start with "a" or "A"

In [98]:
print(text)


If you want this call me at 689878997.
Mr Albert did not access www.piratetorrents.com
Alba was upsed because eseOese.com was not working.



In [131]:
pattern     = re.compile(r'a.*')
all_matches = pattern.findall(text)
all_matches

['ant this call me at 689878997.',
 'access www.piratetorrents.com',
 'a was upsed because eseOese.com was not working.']

In [137]:
### Exercise find all words that start with "a" or "A" 
pattern     = re.compile(r'\s[Aa]+.*?\s')
all_matches = pattern.findall(text)
all_matches

[' at ', ' Albert ', ' access ', '\nAlba ']

### Quantifiers


Quantifiers are operators that are applied to the preceding symbol.


- **`*`** previous symbol appears 0 or more matches.


- **`+`** previous symbol appears 1 or more matches.


- **`?`** previous symbol appears at most one (0 or 1).


- **`{k}`** previous symbol repeated k exact matches.


- **`{min,max}`** previous symbol appears  any number between min and max. For example `{2,8}` would be between 2 and 8 matches


For example:

- `r'a.*'` the `*` refers to  `.` making this expression get trigerred when `a` is followed by any character.


- `r'\d{4}'` the `{4}` referes to `\d` making this regular expression get triggered with 4 consecutive digits.


### Character classes `[...]`

Character classes `[...]` match any character in the brackets. For example, `[aeiou]` matches any vowel.


In [19]:
# find all consecutive pairs of vowels in the text
text = "The man that went to the moon was not following any hierarchy."
pattern     = re.compile(r'[aeiou]{2}')
all_matches = pattern.findall(text)
all_matches

['oo', 'ie']

In [169]:
# find all words that contain consecutive pairs of vowels in the text
pattern     = re.compile(r'\w+[aeiou]{2}\w+')
all_matches = pattern.findall(text)
all_matches

['because']

In [21]:
### Exercie: Find all words starting with "Alb" and the next word, commas should not 
### be taken into account
text = """
If you want this call me at 689878997.
Albert, do not say more times you can't do it.
Alba was upsed because eseOese.com was not working.
"""

In [22]:
pattern     = re.compile(r'Alb[\w,:]*\s\w*')
all_matches = pattern.findall(text)
all_matches

['Albert, do', 'Alba was']

In [23]:
### Exercie: Find all words starting with "Alb" and the two next word, commas should not 
### be taken into account

pattern     = re.compile(r'Alb[\w,:]*\s\w*\s\w*')
all_matches = pattern.findall(text)
all_matches

['Albert, do not', 'Alba was upsed']

In [24]:
### Exercie: Find all words starting with "Alb" and the next word, commas should not 
### be taken into account
text = """
If you want this call me at 689878997.
Albert, Paco, come here and do not say more times you can't do it.
Alba was upsed because eseOese.com was not working.
"""

In [25]:
pattern     = re.compile(r'Alb[\w,:]*\s[\w,:]*\s\w*')
all_matches = pattern.findall(text)
all_matches

['Albert, Paco, come', 'Alba was upsed']

In [30]:
### Exercie: Find all phone numbers (9 consecutive digits)
text = """
If you want this call me at 689878997, do not call 911.
Albert, Paco, come here and do not say more times you can't do it.
Alba was upsed because eseOese.com was not working.
"""

In [31]:
pattern     = re.compile(r'\d')
all_matches = pattern.findall(text)
# all_matches

#### Repetitions: `{p,q}`

Repetitions of a regular expression can be taken into acount with `{}`.

- ``` r'a{2,3}' ``` character `a` appears between 2 to 3 times

In [252]:
pattern     = re.compile(r'\d{9}')
all_matches = pattern.findall(text)
#all_matchess

In [253]:
pattern     = re.compile(r'\d\d\d\d\d\d\d\d\d')
all_matches = pattern.findall(text)
#all_matches


##### Equivalent versions

```
Explicit Long         Short version
(some_expr){0,1}      (some_expr)?
(some_expr){1}        (some_expr)
(some_expr){0,}       (some_expr)*
(some_expr){1,}       (some_expr)+
```


#### Matching Special characters: `special_chars=[ '[', ']' ,'{' ,  '}' , '?' , '+' , '.' ]`

In order to build regular expressions we use special characters such as  

`special_chars=[ '[', ']' ,'{' ,  '}' , '?' , '+' , '.' ]`

If we want to build a regular expression that matches one of this character we need to use the `'\'` operator. For example `'\?'` will match the `?` symbol.



In [292]:
# Ex: find a word followed by `?`
text = """
       Where is maria?
       Who told you this class would be any good ?
       """

In [311]:
pattern     = re.compile(r'\w+[\w\s]\?')
all_matches = pattern.findall(text)
all_matches

['maria?', 'good ?']


#### Anchers

Boundary characters are useful for "anchoring" patterns to some edge without selecting the characters themselves.

- **`\b`** Matches any word boundary.


- **`\B`** Matches any NON word boundary.


- **`^`** Matches beginning of a string


- **`[^a-e]`** negates the character set. That is matches any character outside `a-e`.


- **`$`** Matches a position that is end of a string.

In [338]:
# Ex: Find all occurences of the substring cat in the texts
text1 = "the cat sat in the mat"
text2 = "the cat in the mat did not locate the food."
pattern = re.compile(r'cat')
all_matches_1 = pattern.findall(text1)
all_matches_2 = pattern.findall(text2)
all_matches_1, all_matches_2

(['cat'], ['cat', 'cat'])

In [339]:
# Ex: Find all occurences of the the word cat in the text
text1 = "the cat sat in the mat"
text2 = "the cat in the mat did not locate the food."
pattern = re.compile(r'\bcat')
all_matches = pattern.findall(text2)
all_matches_1 = pattern.findall(text1)
all_matches_2 = pattern.findall(text2)
all_matches_1, all_matches_2

(['cat'], ['cat'])

### String substitution

In [15]:
exp = re.compile('car')
sen_changed = re.sub(exp, 'auto', sen)

In [16]:
sen_changed

'The crashed auto was found in a crowed neighborood of Barcelona.'

## Text tokenization sklearn 

Regular expressions can be used to split strings into tokens

https://datascience.stackexchange.com/questions/54904/how-to-avoid-tokenizing-w-sklearn-feature-extraction

## Good Material 

- https://www.regular-expressions.info/anchors.html