# 网络数据如何获取（爬虫）
抓取网页，解析网页内容
## 抓取
- urllib内建模块
- Requests第三方库
- Scrapy框架
## 解析
- BeautifulSoup库
- re模块
## 第三方API抓取+解析

## Requests库
Requests库是更简单、方便和人性化的Python HTTP第三方库
官网：https://cn.python-requests.org/zh_CN/latest/
基本方法：requests.get() 请求获取指定URL位置的资源，对应HTTP协议的GET方法

In [None]:
# 抓取的网页 https://book.douban.com/subject/1084336/comments/
# 查看爬虫协议 https://book.douban.com/robots.txt
# 需要注意的是，要爬取允许被爬虫的网站

In [14]:
import requests
from bs4 import BeautifulSoup

In [4]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}
url = 'https://book.douban.com/subject/1084336/comments/'
r = requests.get(url, headers=headers)

In [5]:
# 418就是爬取的网站有反爬取机制
# https://blog.csdn.net/weixin_40807714/article/details/109579279
r.status_code

200

In [12]:
r.headers['content-type']

'text/html; charset=utf-8'

In [6]:
r.text

'\n\n<!DOCTYPE html>\n<html lang="zh-cmn-Hans" class="ua-windows ua-webkit book-new-nav">\n<head>\n  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n  <title>\n    小王子 短评\n</title>\n  \n<script>!function(e){var o=function(o,n,t){var c,i,r=new Date;n=n||30,t=t||"/",r.setTime(r.getTime()+24*n*60*60*1e3),c="; expires="+r.toGMTString();for(i in o)e.cookie=i+"="+o[i]+c+"; path="+t},n=function(o){var n,t,c,i=o+"=",r=e.cookie.split(";");for(t=0,c=r.length;t<c;t++)if(n=r[t].replace(/^\\s+|\\s+$/g,""),0==n.indexOf(i))return n.substring(i.length,n.length).replace(/\\"/g,"");return null},t=e.write,c={"douban.com":1,"douban.fm":1,"google.com":1,"google.cn":1,"googleapis.com":1,"gmaptiles.co.kr":1,"gstatic.com":1,"gstatic.cn":1,"google-analytics.com":1,"googleadservices.com":1},i=function(e,o){var n=new Image;n.onload=function(){},n.src="https://www.douban.com/j/except_report?kind=ra022&reason="+encodeURIComponent(e)+"&environment="+encodeURIComponent(o)},r=function(o){try{t.ca

In [None]:
# 用BeautifulSoup进行数据解析
# BeautifulSoup是一个可以从HTML或XML文件中提取数据的python库
# 官网：https://beautifulsoup.cn/
# <span class="short">十几岁的时候渴慕着小王子，一天之间可以看四十四次日落。是在多久之后才明白，看四十四次日落的小王子，他有多么难过。</span>

In [None]:
# 用re正则表达式模块进行各类正则表达式处理
# 官网：https://docs.python.org/zh-cn/3.9/library/re.html
# <span class="user-stars allstar50 rating" title="力荐"></span>

In [15]:
markup = '<p class="title"><b>The Little Prince</b></p>'

In [16]:
soup = BeautifulSoup(markup, "lxml")

In [17]:
soup.b

<b>The Little Prince</b>

In [18]:
type(soup.b)

bs4.element.Tag

In [19]:
tag = soup.p

In [20]:
tag.name

'p'

In [21]:
tag.attrs

{'class': ['title']}

In [22]:
tag['class']

['title']

In [23]:
tag.string

'The Little Prince'

In [24]:
type(tag.string)

bs4.element.NavigableString

In [25]:
soup.find_all('b')

[<b>The Little Prince</b>]