&emsp;&emsp;在向目标网站发送请求并成功获取页面信息后，就需要对其进行解析，以提取想要的数据。最基本的解析数据的方式是正则表达式，而在 Python 中，还可以通过 lxml、BeautifulSoup、pyquery 等解析库达到同样的目的。

## 3.1 正则表达式

In [1]:
import re

### 3.1.1 match()

In [8]:
content = 'Hello 123 4567 World_This is a Regex Demo'

&emsp;&emsp;简单示例

In [22]:
result = re.match('^Hello\s\d\d\d\s\d{4}\s\w{10}', content)

print(result)
print(result.group())
print(result.span())

<re.Match object; span=(0, 25), match='Hello 123 4567 World_This'>
Hello 123 4567 World_This
(0, 25)


&emsp;&emsp;匹配目标

In [19]:
result = re.match('^Hello\s(\d+)\s\d{4}\sWorld', content)

print(result)
print(result.group())
print(result.group(0))
print(result.group(1))
print(result.span())

<re.Match object; span=(0, 20), match='Hello 123 4567 World'>
Hello 123 4567 World
Hello 123 4567 World
123
(0, 20)


&emsp;&emsp;通用匹配

In [21]:
result = re.match('^Hello.*Demo$', content)

print(result)
print(result.group())
print(result.span())

<re.Match object; span=(0, 41), match='Hello 123 4567 World_This is a Regex Demo'>
Hello 123 4567 World_This is a Regex Demo
(0, 41)


&emsp;&emsp;贪婪匹配

In [23]:
result = re.match('^Hello.*(\d+).*Demo$', content)

print(result)
print(result.group(1))

<re.Match object; span=(0, 41), match='Hello 123 4567 World_This is a Regex Demo'>
7


&emsp;&emsp;非贪婪匹配

In [24]:
result = re.match('^Hello.*?(\d+).*Demo$', content)

print(result)
print(result.group(1))

<re.Match object; span=(0, 41), match='Hello 123 4567 World_This is a Regex Demo'>
123


&emsp;&emsp;修饰符

In [37]:
content = '''Hello 123 4567 World_This 
is a Regex Demo
'''

result = re.match('^Hello.*?(\d+).*?Demo$', content, re.S)

print(result)
print(result.group(1))

<re.Match object; span=(0, 42), match='Hello 123 4567 World_This \nis a Regex Demo'>
123


&emsp;&emsp;转义匹配

In [45]:
content = '(gayhub)www.github.com'

result = re.match('\(gayhub\)www\.github\.com', content)

print(result)
print(result.group())

<re.Match object; span=(0, 22), match='(gayhub)www.github.com'>
(gayhub)www.github.com


### 3.1.2 search()

In [62]:
content = 'Extra_String Hello 123 4567 World_This is a Regex Demo Extra_String'

In [63]:
result = re.match('Hello.*?(\d+).*?Demo', content)
print(result)

None


In [64]:
result = re.search('Hello.*?(\d+).*?Demo', content)
print(result)

<re.Match object; span=(13, 54), match='Hello 123 4567 World_This is a Regex Demo'>


``` html
<li>
    <a class="package-snippet" href="/project/isocor/">
        <h3 class="package-snippet__title">
            <span class="package-snippet__name">isocor</span>
            <span class="package-snippet__version">2.0.4</span>
        </h3>
        <p class="package-snippet__description">IsoCor: Isotope Correction for mass spectrometry labeling experiments</p>
    </a>
</li>
```