In [2]:
import re

- references
    - https://regexr.com/

## baiscs

- 开头和结尾匹配
    - 开头： `^xxx`
    - 结尾：`xxx$`
    - `^(?!str)`：不以开头；
        - `?!` 这里是否定向前查询
    - `(?<!str)$`：不以结束；
        - `?<!`否定式向后查询
- `\`用于转义，
    - 匹配 `$` 需要转义 `\$`
    - 匹配 `.` 需要转义 `\.`
- `\s`: 匹配包括换行符
- `\d`与`\w`
    - `\d`：digital，`0-9`
    - `\w`：`a-z, A-Z, 0-9, _`
        - 不包括 `&`
- 量词（quantifiers）
    - *：0+
    - ?：0/1
    - +：1+
    - {n}：exactly n occurrences
- 惰性匹配（laziness）
    - ?/??


## match

- re.match 仅从头开始匹配
- re.fullmatch 完整匹配

In [60]:
urls = ['https://www.socratica.com', 
        'http://www.socratica.org', 
        'http://www.abc.bcd.org', 
        'file://test.this.path', 
        'com.socratica.www.https://']

### re.match

In [51]:
regex = 'https?'
for url in urls:
    if re.match(regex, url):
        print(url)

https://www.socratica.com
http://www.socratica.org


### fullmatch

In [61]:
regex = 'https?://w{3}.\w+.(org|com)'
for url in urls:
    if re.fullmatch(regex, url):
        print(url)

https://www.socratica.com
http://www.socratica.org


## `re.search` 与 `re.findall`

- `re.findall`：返回的是 list

In [2]:
s = "The bottle of water costs $3.24 and that's outrageous... it's like 3x what it should be!"

In [4]:
regex = '\$\s*(\d+\.\d+)\W*'
re.findall(regex, s)

['3.24']

### re.search

In [23]:
names = ['Finn  Bindeballe', 
         'Geir Anders Berge', 
         'HappyCodingRobot', 
         'Ron   Cromberge', 
         'Sohil']

In [24]:
# 有名有姓
regex = '^\w+\s+\w+$'
for name in names:
    res = re.search(regex, name)
    # 表示匹配上
    if res:
        print(name)

Finn  Bindeballe
Ron   Cromberge


In [32]:
regex = 'C\w*'
for name in names:
    res = re.search(regex, name)
    if res:
        print(name)
        print(res.start(), res.end(), name[res.start():res.end()], )
        print(res.span(), res.group(), res.group(0), )

HappyCodingRobot
5 16 CodingRobot
(5, 16) CodingRobot CodingRobot
Ron   Cromberge
6 15 Cromberge
(6, 15) Cromberge Cromberge


### findall

In [40]:
names = ['Brian Daugette', 
         'Veronica Supersonica', 
         'Tony Gasparovic', 
         'Patrick Germann', 
         'm!sha']

In [41]:
regex = '[a-z]+'
for name in names:
    matches = re.findall(regex, name)
    if matches:
        print(matches)

['rian', 'augette']
['eronica', 'upersonica']
['ony', 'asparovic']
['atrick', 'ermann']
['m', 'sha']


In [46]:
regex = '[a-z]+'
for name in names:
    matches = re.finditer(regex, name)
    if matches:
        for match in matches:
            print(match)

<re.Match object; span=(1, 5), match='rian'>
<re.Match object; span=(7, 14), match='augette'>
<re.Match object; span=(1, 8), match='eronica'>
<re.Match object; span=(10, 20), match='upersonica'>
<re.Match object; span=(1, 4), match='ony'>
<re.Match object; span=(6, 15), match='asparovic'>
<re.Match object; span=(1, 7), match='atrick'>
<re.Match object; span=(9, 15), match='ermann'>
<re.Match object; span=(0, 1), match='m'>
<re.Match object; span=(2, 5), match='sha'>


## cases

### re.search

In [12]:
import re

s = "alpha.Customer[cus_Y4o9qMEZAugtnW] ..."
match = re.search(r"\[([A-Za-z0-9_]+)\]", s)
# match.group() == match.group(0)，整体正则的匹配
print(match.group(), match.group(0))
# match.group(1)：返回的第一个 `()` 包起来的内容
print(match.group(1))

[cus_Y4o9qMEZAugtnW] [cus_Y4o9qMEZAugtnW]
cus_Y4o9qMEZAugtnW


### groups

In [33]:
names = ['Brian Daugette', 
         'Veronica Supersonica', 
         'Tony Gasparovic', 
         'Patrick Germann', 
         'm!sha']

In [34]:
regex = '^\w+\s+\w+$'
for name in names:
    match = re.search(regex, name)
    if match:
        print(name)

Brian Daugette
Veronica Supersonica
Tony Gasparovic
Patrick Germann


In [36]:
regex = '^(\w+)\s+(\w+)$'
for name in names:
    match = re.search(regex, name)
    if match:
        print(f'first name: {match.group(1)}, last name: {match.group(2)}')

first name: Brian, last name: Daugette
first name: Veronica, last name: Supersonica
first name: Tony, last name: Gasparovic
first name: Patrick, last name: Germann


In [39]:
regex = '^(?P<fn>\w+)\s+(?P<ln>\w+)$'
for name in names:
    match = re.search(regex, name)
    if match:
        print(f'first name: {match.group("fn")}, last name: {match.group("ln")}')

first name: Brian, last name: Daugette
first name: Veronica, last name: Supersonica
first name: Tony, last name: Gasparovic
first name: Patrick, last name: Germann


## 其他

### `?` 元字符用法

- (?:str)   非捕获组

- (?=str) 肯定式向前查找

- (?!str) 否定式向前查找

- (?<=str) 肯定式向后查找

- (?<!str) 否定式向后查找

### 是否匹配某个单词（这个单词可能出现，也可能没出现）

- 例如，假设你想要匹配 "color" 和 "colour" 这两个拼法。你可以编写正则表达式 "colou?r"，在这里 "u?" 表示 "u" 是可选的。
- 但是，如果你想要匹配的是整个单词，例如 "color" 是可选的，你应该编写正则表达式为 "(color)?"

### 忽略大小写

- https://stackoverflow.com/questions/500864/case-insensitive-regular-expression-without-re-compile

```
re.findall('(?i)xx', s)
re.findall('xx', s, re.IGNORECASE)
```

### 多个空格替换为一个空格

In [4]:
str1 = '  rwe fdsa    fasf   '
re.sub(' +', ' ', str1)

' rwe fdsa fasf '