# 豆瓣爬虫分步骤演示

## 使用Python的requests库读取网页并提取返回的数据

下面演示如何获取到https://movie.douban.com/subject/11026735/ 的HTML代码

Tips：如何快速获取到对于的python代码？

![拷贝浏览器请求](resources/douban-1.png)
![转换成代码](resources/douban-2.png)

In [None]:
import requests

headers = {
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'sec-ch-ua': '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
    'sec-ch-ua-mobile': '?0',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-User': '?1',
    'Sec-Fetch-Dest': 'document',
    'Referer': 'https://movie.douban.com/subject/11026735/comments?start=200&limit=20&status=P&sort=new_score',
    'Accept-Language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7',
}

response = requests.get('https://movie.douban.com/subject/11026735/', headers=headers)

显示前几百个字符

In [None]:
response.text[0:500]

## 通过正则表达式获取电影标题名称

正则表达式测试:https://regex101.com/r/AFRdGf/2

In [None]:
import re

regex = r"\"name\":\s\"(?P<name>.*?)\",\s*\"url\":\s\"(?P<url>.*?)\",\s*\"image\":\s\"(?P<image>.*?)\""
re.findall(regex,response.text)

## 对于豆瓣，可以提取json+ld来获取相关信息

相关正则表达式：https://regex101.com/r/Ac3L5n/1

In [None]:
import json

regex = r"ld\+json\">(.*?)</script>"
jsonld = re.findall(regex,response.text, re.MULTILINE | re.DOTALL)[0]

movie = json.loads(jsonld)
movie['name']

In [None]:
# 编剧
[x['name'] for x in movie['author']]

In [None]:
# 演员
[x['name'] for x in movie['actor']]

In [None]:
# 评分
float(movie['aggregateRating']['ratingValue'])

## 通过BeautifulSoup提取

BeautifulSoup提供了解析XML、HTML的各种方法，并且提供了选择器可以更快的定位到需要的元素。
CSS selector可以通过浏览器直接得到

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
[x.text for x in soup.select("#info > span.actor > span.attrs > a")]

# 小练习
修改下面的代码，使之能够读取到该电影的影评信息：https://movie.douban.com/subject/11026735/comments

In [None]:
url = "填入正确的url"
response = requests.get(url, headers=headers)
response.content[0:500]

Tips: 获取评论 Copy selector得到CSS表达式
`#comments > div:nth-child(1) > div.comment > p > span`
注意这里有一个`div:nth-child(1)`，指的是第一个div元素，如果要遍历所有的，需要删除`:nth-child(1)`

![](./resources/douban-practice-tip-1.png)


In [None]:
soup = BeautifulSoup(response.content, 'html.parser')
[x.text for x in soup.select("填入正确的css selector")]

In [None]:
[x.text for x in soup.select("#comments > div > div.comment > h3 > span.comment-info > a")]

找到下一页的链接并进行遍历

In [None]:
baseurl = "https://movie.douban.com/subject/11026735/comments"
nextpage_url = ""
while True:
    response = requests.get(baseurl + nextpage_url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    nextpage = soup.select("#paginator > a.next")
    content = [x.text for x in soup.select("#comments > div > div.comment > p > span")]
    print("这一页所有的评论", content)
    
    if len(nextpage) == 0:
        break
        
    nextpage_url = nextpage[0]['href']
    print("正在获取", nextpage_url)
    if nextpage_url is None:
        break