**python中处理regular expression的标准库**

@creation time: 2023-08-29  
@follow: 
1. [re-python docs](https://docs.python.org/zh-cn/3/library/re.html)
2. [r2coding](https://r2coding.com/#/README?id=正则表达式)
3. [regex101.com-正则表达式练习网站](https://regex101.com)
4. https://www.cnblogs.com/CYHISTW/p/11363209.html

In [1]:
import re

## 引言

* 正则表达式是表达字符串的模式，常用于文本中的查找匹配

---
## 转义字符

In [2]:
# python中使用字符\进行转义
print("ABC\\-001")
print("ABC\-001")

# 而正则表达式中也使用\转义、
# 推荐写正则表达式时一直加上r, 即取消python默认的转义
print(r"ABC\\-001")
print(r"ABC\-001")

ABC\-001
ABC\-001
ABC\\-001
ABC\-001


  print("ABC\-001")


---
# 搜索匹配

## re.match()
从字符串的起始位置匹配一个模式, 没有匹配或不是起始位置, 返回none

In [3]:
string = "www.python.org"
regex = r"www"

print(re.match(pattern=regex, string=string))

<re.Match object; span=(0, 3), match='www'>


## re.search()
扫描整个字符串, 返回第一个匹配成功的字符串

In [4]:
string = "www.python.org"
regex = r"o"

searchObj = re.search(pattern=regex, string=string)
searchObj.group()

'o'

## re.findall()
* 找到正则表达式匹配的全部子串, 返回列表
* 有多个匹配模式, 返回元组列表
* 无匹配,返回空字符串

In [5]:
string = "www.python.org"
regex = r"(python|org)"

re.findall(pattern=regex, string=string)

['python', 'org']

## re.finditer()
跟findall()差不多, 只不过返回的是迭代器

In [6]:
string = "www.python.org"
regex = r"(python|org)"

for item in re.finditer(pattern=regex, string=string):
    print(item.group())

python
org


## re.complile()
编译相同的正则表达式, 产生的pattern对象有re的全部功能

In [7]:
string = "one12twothree34four"
pattern = re.compile(r'\d+')  # 编译正则表达式, 生成re.Pattern对象

In [8]:
# match()函数依旧是从头匹配, 但是这个函数提供位置参数pos
print(pattern.match(string))
print(pattern.match(string, pos=1))
print(pattern.match(string, pos=4))

None
None
<re.Match object; span=(4, 5), match='2'>


## re.split()
以匹配字符为分隔, 分隔字符串, 然后返回列表

In [9]:
string = "www.python.org"
regex = r"\.p|\.o"

re.split(pattern=regex, string=string)

['www', 'ython', 'rg']

In [10]:
string = "www.python.org"
regex = r"(\.p|\.o)"  # 如果使用捕获组, 则保留对应的分隔字符串

re.split(pattern=regex, string=string)

['www', '.p', 'ython', '.o', 'rg']

---
# 替换

## re.sub()
re.sub(pattern="正则表达式", repl=<替换的字符串或函数>, string="被替换的字符串", count=<模式替换次数, 0表示全部替换>)

In [11]:
phone = "2004-959-559 # 这是一个国外电话号码"

# 删除字符串中的python注释
regex = r"\s#.*$"
re.sub(pattern=regex, repl="", string=phone)  #  这个例子中替换字符串是空字符串

'2004-959-559'

In [12]:
# repl参数传入一个函数, 可实现对匹配到的内容进行操作
def func(matched):
    value = int(matched.group("value"))
    return str(value * 2)

string = "A123B234"
re.sub(pattern=r"(?P<value>\d+)", repl=func, string=string)

'A246B468'

---
# 正则表达式练习

@follow:《正则表达式必知必会》

In [13]:
import re

sec 2.1.2 正则表达式是区分大小写的

In [14]:
string = "abc ABC cba"
regex = "abc"
re.findall(regex, string)

['abc']

sec 2.3 .字符不能匹配换行符

In [15]:
string = "\t123"
regex = ".1"
re.findall(regex, string)

['\t1']

In [16]:
string = "\n123"
regex = ".1"
re.findall(regex, string)

[]

sec 8.2 使用反向引用

In [17]:
content = "This is a block of of text, several words here are are repeated, and and they should not be."
print("文本: \n", content)

re.findall(pattern=r"\s+(\w+)\s+(\1)", string=content)  # ! 注意与书中的不同

文本: 
 This is a block of of text, several words here are are repeated, and and they should not be.


[('of', 'of'), ('are', 'are'), ('and', 'and')]

sec 8.3 使用反向引用实现替换

In [18]:
"""
整理电话号码格式
"""
content = "313-555-1234\n248-555-9999\n810-555-9000"
print(f"文本: \n{content}")

content_sub = re.sub(pattern=r"(\d{3})-(\d{3})-(\d{4})", repl=r"(\1) \2-\3", string=content)
print(f"替换后的文本: \n{content_sub}")

文本: 
313-555-1234
248-555-9999
810-555-9000
替换后的文本: 
(313) 555-1234
(248) 555-9999
(810) 555-9000


In [19]:
"""
将HTML的标题1内容大写
"""
content = "<body>\n<h1>Welcome to my Homepage</h1>\nContent is divided into two sections:<br/> <h2>SQL</h2>\nInformation about SQL."
print(f"文本: \n{content}")

def uppper_repl(match):
    return match.group(1) + match.group(2).upper() + match.group(3)  # ! python不支持书中的表达，需要用这种方法

content_sub = re.sub(pattern=r"(<[Hh]1>)(.*)(<\/[Hh]1>)", 
                     repl=uppper_repl, string=content)
print(f"\n替换后的文本: \n{content_sub}")

文本: 
<body>
<h1>Welcome to my Homepage</h1>
Content is divided into two sections:<br/> <h2>SQL</h2>
Information about SQL.

替换后的文本: 
<body>
<h1>WELCOME TO MY HOMEPAGE</h1>
Content is divided into two sections:<br/> <h2>SQL</h2>
Information about SQL.


sec 9.2 环视-向前查看

In [20]:
content = "http://www.forta.com/\nhttps://mail.forta.com/\nftp://ftp.forta.com/"
print(f"文本: \n{content}")

re.findall(pattern=r"\w+(?=:)", string=content)

文本: 
http://www.forta.com/
https://mail.forta.com/
ftp://ftp.forta.com/


['http', 'https', 'ftp']

sec 9.3 环视-向后查看

In [21]:
content = "ABC01: $23.45\nHGG42: $5.31\nCFMX1: $899.00\nXTC99: $69.96\nTotal items found: 4"
print(f"文本: \n{content}")
# 匹配$后面的金额
re.findall(pattern=r"(?<=\$)[0-9.]+", string=content)

文本: 
ABC01: $23.45
HGG42: $5.31
CFMX1: $899.00
XTC99: $69.96
Total items found: 4


['23.45', '5.31', '899.00', '69.96']

sec 9.4 组合向前和向后查看

In [22]:
content = "<head>\n<title>Ben Forta's Homepage</title>\n</head>"
print(f"文本: \n{content}")

re.findall(pattern=r"(?<=<title>).*(?=<\/title>)", string=content)

文本: 
<head>
<title>Ben Forta's Homepage</title>
</head>


["Ben Forta's Homepage"]

sec 9.5 否定式环视

In [23]:
content = "I paid $30 for 100 apples, 50 oranges, and 60 pears.\nI saved $5 on this order."
print(f"文本: \n{content}")

# 匹配数量
re.findall(pattern=r"\b(?<!\$)\d+\b", string=content)

文本: 
I paid $30 for 100 apples, 50 oranges, and 60 pears.
I saved $5 on this order.


['100', '50', '60']

sec 10.2.1 反向引用条件

In [53]:
'''
只有if语句的反向引用条件
'''
content = '<!-- Nav bar -->\n\
<div>\n\
<a href="/home"><img src="/images/home.gif"></a>\n\
<img src="/images/spacer.gif">\n\
<a href="/search"><img src="/images/search.gif"></a>\n\
<img src="/images/spacer.gif">\n\
<a href="/help"><img src="/images/help.gif"></a>\n\
</div>'

print(f"文本: \n{content}")


# 匹配所有<img>标签和图像链接<a>和</a>之间的部分
print("\n==> result:")
for item in re.finditer(pattern=r"(<a\s[^>]+>)?<img\s[^>]+>(?(1)<\/a>)", string=content):
    print(item.group())

文本: 
<!-- Nav bar -->
<div>
<a href="/home"><img src="/images/home.gif"></a>
<img src="/images/spacer.gif">
<a href="/search"><img src="/images/search.gif"></a>
<img src="/images/spacer.gif">
<a href="/help"><img src="/images/help.gif"></a>
</div>

==> result:
<a href="/home"><img src="/images/home.gif"></a>
<img src="/images/spacer.gif">
<a href="/search"><img src="/images/search.gif"></a>
<img src="/images/spacer.gif">
<a href="/help"><img src="/images/help.gif"></a>


In [68]:
"""
if-else语句的反向引用条件
"""
content = "123-456-7890\n\
(123)456-7890\n\
(123)-456-7890\n\
(123-456-7890\n\
1234567890\n\
123 456 7890"
print(f"文本: \n{content}")

# 匹配正确的电话号码格式(123)456-7890或123-456-7890
print("\n==> result:")
for line in content.split("\n"):
    if re.match(pattern=r"(\()?\d{3}(?(1)\)|-)\d{3}-\d{4}", string=line):
        print(line)

文本: 
123-456-7890
(123)456-7890
(123)-456-7890
(123-456-7890
1234567890
123 456 7890

==> result:
123-456-7890
(123)456-7890


sec 10.2.2 环视实现条件判断

In [132]:
content = "11111\n22222\n33333-\n44444-4444"
print(f"文本: \n{content}")

# 匹配满足格式(5位数字，或者5+4位数字且之间有连字符-)的数字
print("\n==> result:")
try:
    for line in content.split("\n"):
        if re.match(pattern=r"\d{5}(?(?=-)-\d{4})", string=line):
            print(line)
except:
    print("python不支持环视实现条件判断??")

# 可以通过或语句代替
re.findall(pattern=r"\d{5}(?=[^-])|\d{5}-\d{4}", string=content)

文本: 
11111
22222
33333-
44444-4444

==> result:
python不支持环视实现条件判断??


['11111', '22222', '44444-4444']