# REGEX

In [1]:
from re import search

> `search()` scans the search string from left to right, and as soon as it locates a match for \<regex\>, it stops scanning and returns the match.

In [2]:
search('123', 'foo123bar')

<re.Match object; span=(3, 6), match='123'>

Si no existe coincidencia, retorna **None**.

El objeto **Match** contiene bastante información.

El objeto **Match** puede ser usado como **booleano**.

## Python Regex Metacharacters

Character(s) | Meaning
------------ | -------
.   |   Matches any single character except newline
^   |   ∙ Anchors a match at the start of a string <br>∙ Complements a character class
$   |   Anchors a match at the end of a string
\*  |   Matches zero or more repetitions
\+  |   Matches one or more repetitions
?   |   ∙ Matches zero or one repetition <br>∙ Specifies the non-greedy versions of *, +, and ? <br>∙ Introduces a lookahead or lookbehind assertion <br>∙ Creates a named group
{}  |   Matches an explicitly specified number of repetitions
\   |   ∙ Escapes a metacharacter of its special meaning <br>∙ Introduces a special character class <br>∙ Introduces a grouping backreference
[]  |   Specifies a character class
|   |   Designates alternation
()  |   Creates a group
:<br>#<br>=<br>!   |   Designate a specialized group
<>  |   Creates a named group

It’s good practice to use a **raw string** to specify a regex in Python whenever it contains backslashes.

## Metacharacters That Match a Single Character

### `[]`

Match a Single Character

Specifies a **character class**

Specifies a specific set of characters to match.

To include a literal `-` character it has to be at the beginning or at the end. Or escape it with a backslash `\`

To include a literal `]` character it has to be at the beginning. Or escape it with a backslash `\`

Other regex metacharacters lose their special meaning inside a character class.

In [3]:
search('[0-9][0-9][0-9]', 'foo123bar')

<re.Match object; span=(3, 6), match='123'>

In [4]:
search('[a-z][a-z][a-z]', '123foo456')

<re.Match object; span=(3, 6), match='foo'>

In [5]:
search('[0-9a-fA-f]', '--- a0A2F9 ---')
# [0-9a-fA-F] matches any hexadecimal digit character
# Match the first coincidence

<re.Match object; span=(4, 5), match='a'>

In [6]:
search('ba[artz]', 'foobarqux')
# [artz] matches any single 'a', 'r', 't', or 'z' character.
# 'baa', 'bar', 'bat', 'baz'

<re.Match object; span=(3, 6), match='bar'>

### `^`

Complement a character class by specifying it as the first character.

It matches any character that isn’t in the set.

if isn't the first character, then it has no special meaning.

In [7]:
search('[^0-9]', '12345foo')

<re.Match object; span=(5, 6), match='f'>

### `.`

A wildcard. Matches any character except a newline.

In [8]:
search('1.3', 'foo123bar')
# '1' and '3' match literally, and the . matches the '2'.
# Does the string contain a '1', then any character (except a newline), then a '3'?

<re.Match object; span=(3, 6), match='123'>

In [9]:
print(search('1.3','foo13bar'))

None


### `\d` or `\D`

`\d` matches any decimal digit character. Shorthand for `[0-9]`

`\D` is the opposite. Shorthand for `[^0-9]`

In [10]:
search('\d', 'abc4def')

<re.Match object; span=(3, 4), match='4'>

### `\w` or `\W`

`\w` matches any alphanumeric word character and the underscore `_`. Shorthand for `[a-zA-Z0-9_]`

`\W` is the opposite. Shorthand for `[^a-zA-Z0-9_]`

`\w` includes `\d`

In [11]:
print(search('\w','#()._$@&'))

<re.Match object; span=(4, 5), match='_'>


### `\s` or `\S`

`\s` matches any whitespace or newline character.

`\S` is the opposite.

In [12]:
search('\s', 'foo\nbar')

<re.Match object; span=(3, 4), match='\n'>

### `\`

Escapes a metacharacter of its special meaning.

To scape de `\` itself it is a complex thing. The best way to handle this is to specify the \<regex\> using a **raw string**.

It’s good practice to use a raw string to specify a regex in Python whenever it contains backslashes.

In [13]:
search('\.', 'foo.bar')

<re.Match object; span=(3, 4), match='.'>

In [14]:
search(r'\\', r'foo\bar')

<re.Match object; span=(3, 4), match='\\'>

## Anchors

Anchors are zero-width matches, dictates a particular location in the search string where a match must occur.

### `^` or `\A`

The coincidence must be at the beginning.

`^` and `\A` behave slightly differently from each other in MULTILINE mode.

In [15]:
search('^foo', 'foobar')

<re.Match object; span=(0, 3), match='foo'>

In [16]:
print(search('^foo', 'barfoo'))

None


### `$` or `\Z`

The coincidence must be at the end. Whatever precedes `$` or `\Z` must constitute the end of the search string.

As a special case, `$` (but not `\Z`) also matches just before a single newline at the end of the search string.

`$` and `\Z` behave slightly differently from each other in MULTILINE mode.

In [17]:
search('bar$', 'foobar')

<re.Match object; span=(3, 6), match='bar'>

In [18]:
print(search('bar$', 'barfoo'))

None


In [19]:
search('bar$', 'foobar\n')

<re.Match object; span=(3, 6), match='bar'>

### `\b` or `\B`

For `\b` the coincidence must be at the beginning or end of a **word**. The word boundary.

A word consists of a sequence of alphanumeric characters or underscores (`[a-zA-Z0-9_]`), the same as for the `\w` character class.

`\B` is the opposite.

In [20]:
search(r'\bdad', 'persona dadivosa')

<re.Match object; span=(8, 11), match='dad'>

In [21]:
search(r'\bdad', 'persona.dadivosa')

<re.Match object; span=(8, 11), match='dad'>

In [22]:
search(r'dad\b', 'mentalidad positiva')

<re.Match object; span=(7, 10), match='dad'>

## Quantifiers

A quantifier metacharacter immediately follows a portion of a \<regex\> and indicates how many times that portion must occur for the match to succeed.

### `*`

Matches **zero or more** repetitions of the preceding regex.

In [23]:
print(search('ep*ti', 'setimo'))
print(search('ep*ti', 'septimo'))
print(search('ep*ti', 'seppppptimo'))

<re.Match object; span=(1, 4), match='eti'>
<re.Match object; span=(1, 5), match='epti'>
<re.Match object; span=(1, 9), match='epppppti'>


In [24]:
search('foo.*bar', '# foo $qux@grault % bar #')

<re.Match object; span=(2, 23), match='foo $qux@grault % bar'>

### `+`

Matches **one or more** repetitions of the preceding regex.

In [25]:
print(search('ep+ti', 'setimo'))
print(search('ep+ti', 'septimo'))
print(search('ep+ti', 'seppppptimo'))

None
<re.Match object; span=(1, 5), match='epti'>
<re.Match object; span=(1, 9), match='epppppti'>


### `?`

Matches **zero or one** repetitions of the preceding regex.

In [26]:
print(search('ep?ti', 'setimo'))
print(search('ep?ti', 'septimo'))
print(search('ep?ti', 'seppppptimo'))

<re.Match object; span=(1, 4), match='eti'>
<re.Match object; span=(1, 5), match='epti'>
None


### `*?` or `+?` or `??`

The non-greedy (or lazy) versions of the `*`, `+`, and `?` quantifiers.

In [27]:
print(search('<.*>', '%<foo> <bar> <baz>%'))
print(search('<.*?>', '%<foo> <bar> <baz>%'))

<re.Match object; span=(1, 18), match='<foo> <bar> <baz>'>
<re.Match object; span=(1, 6), match='<foo>'>


In [28]:
print(search('ep?', 'seppppptimo')) # Entre cero o una, escoje una
print(search('ep??', 'seppppptimo')) # Entre cero o una, escoje cero

<re.Match object; span=(1, 3), match='ep'>
<re.Match object; span=(1, 2), match='e'>


### `{m}`

Matches exactly m repetitions of the preceding regex.

In [29]:
print(search('x-{3}x', 'x--x'))
print(search('x-{3}x', 'x---x'))
print(search('x-{3}x', 'x----x'))

None
<re.Match object; span=(0, 5), match='x---x'>
None


### `{m, n}`

Matches any number of repetitions of the preceding regex from **m** to **n**, inclusive.

Omitting **m** implies a lower bound of 0, and omitting **n** implies an unlimited upper bound.

Using `{}` with nothing inside, matches just the literal string `{}`

In [30]:
print(search('x-{3,4}x', 'x--x'))
print(search('x-{3,4}x', 'x---x'))
print(search('x-{3,4}x', 'x----x'))
print(search('x-{3,4}x', 'x-----x'))

None
<re.Match object; span=(0, 5), match='x---x'>
<re.Match object; span=(0, 6), match='x----x'>
None


### `{m, n}?`

The non-greedy (lazy) version of {m,n}.

`{m,n}` will match as many characters as possible, and `{m,n}?` will match as few as possible:

In [31]:
print(search('a{3,5}', 'aaaaaaaa'))
print(search('a{3,5}?', 'aaaaaaaa'))

<re.Match object; span=(0, 5), match='aaaaa'>
<re.Match object; span=(0, 3), match='aaa'>


## Groups

Grouping constructs break up a regex in Python into subexpressions or groups. This serves two purposes:

* Grouping

* Capturing

The groups can be nested.

### `(<regex>)`

**GROUPING**: To group a entire expression.

In [33]:
search('(bar)+', 'foo barbarbarbar baz')

<re.Match object; span=(4, 16), match='barbarbarbar'>

### `(<regex>)`

**CAPTURING**: To extract part of the match.

In [39]:
m = search('(^\w+)\.pdf$', 'file_record_transcript.pdf')
m.group(1)
# Extract only de name, without the extension.

'file_record_transcript'

In [44]:
m = search('(\w+),(\w+),(\w+)', 'foo,quux,baz')
print(m.groups())
print(m.group()) # m.group(0)
print(m.group(2))
print(m.group(1,2))

('foo', 'quux', 'baz')
foo,quux,baz
quux
('foo', 'quux')


### `(?P<name><regex>)`

Creates a named **captured** group.

In [45]:
m = search('(?P<titulo>[^/]+)$', ' https://www.eluniversal.com.co/regional/bolivar/fuimos-la-mejor-unidad-de-la-region-en-todo-el-ano-policia-de-bolivar-NF6070078')
m.group('titulo')

'fuimos-la-mejor-unidad-de-la-region-en-todo-el-ano-policia-de-bolivar-NF6070078'

### `(...|...)`

Logical OR

In [34]:
print(search('(cats|dogs)', 'I love cats'))
print(search('(cats|dogs)', 'I love dogs'))


<re.Match object; span=(7, 11), match='cats'>
<re.Match object; span=(7, 11), match='dogs'>


## Flags

`search(<regex>, <string>, <flags>)`

Scans a string for a regex match, applying the specified modifier \<flags\>.

Makes matching of alphabetic characters case-insensitive, Causes the dot metacharacter to match a newline, Specifies Unicode encoding for character classification and others use...

## RegEx en consola

**grep** o **find**

`cat results.csv | grep ',3[0-9],' `