# 靜態網頁的資料爬蟲策略


* 了解靜態網頁的資料爬蟲策略
* 認識適用於靜態網頁爬蟲的相關套件工具：Request
* 認識適用於靜態網頁爬蟲的相關套件工具：BeatifulSoup

## 作業目標

利用 Request + BeatifulSoup 爬取下列兩個網站內容並解析：

1. Dcard 網址： https://www.dcard.tw/f
2. 知乎： https://www.zhihu.com/explore

並且回答下面問題：

1. Request 取回之後該怎麼取出資料，資料型態是什麼？
2. 為什麼要使用 BeatifulSoup 處理？處理後的型態是什麼？
3. 觀察一下知乎回來的資料好像有點怪怪的，該怎麼解決？

### 1. Dcard 網址： https://www.dcard.tw/f

In [0]:
import requests
from bs4 import BeautifulSoup

In [5]:
url = 'https://www.dcard.tw/f'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}    

r = requests.get(url,headers=headers)

r.encoding = 'utf-8'
print(r.text[0:3000])

<!DOCTYPE html><html lang="zh-Hant-TW"><head prefix="og: http://ogp.me/ns#" itemscope="" itemType="https://schema.org/WebSite"><title data-react-helmet="true">Dcard</title><meta data-react-helmet="true" property="og:image" content="https://www.dcard.tw/build/landing-c9e7b8fb.png"/><meta data-react-helmet="true" property="og:image:secure_url" content="https://www.dcard.tw/build/landing-c9e7b8fb.png"/><meta data-react-helmet="true" charSet="utf-8"/><meta data-react-helmet="true" http-equiv="X-UA-Compatible" content="IE=edge"/><meta data-react-helmet="true" name="application-name" content="Dcard"/><meta data-react-helmet="true" name="apple-itunes-app" content="app-id=951353454"/><meta data-react-helmet="true" name="theme-color" content="#006aa6"/><meta data-react-helmet="true" name="mobile-web-app-capable" content="yes"/><meta data-react-helmet="true" name="apple-mobile-web-app-capable" content="yes"/><meta data-react-helmet="true" property="fb:app_id" content="211628828926493"/><meta dat

In [6]:
print('Request 取回之後該怎麼取出資料，資料型態是什麼？ => ', type(r.text))

Request 取回之後該怎麼取出資料，資料型態是什麼？ =>  <class 'str'>


In [7]:
soup = BeautifulSoup(r.text,'html.parser')
print(soup)

2-01T15:44:18.702Z","updatedAt":"2019-12-02T07:39:05.621Z"},{"id":"1599354a-c141-4912-9e7f-9e1aca04ffba","url":"https:\u002F\u002Fi.imgur.com\u002FeuIl46P.gif","normalizedUrl":"https:\u002F\u002Fimgur.com\u002FeuIl46P","thumbnail":"https:\u002F\u002Fi.imgur.com\u002FeuIl46Pl.jpg","type":"image\u002Fimgur","tags":["ANNOTATED"],"createdAt":"2019-12-01T15:44:18.702Z","updatedAt":"2019-12-02T07:39:05.621Z"},{"id":"5b52957c-41d1-4dea-8a17-1ed1b6f38927","url":"https:\u002F\u002Fi.imgur.com\u002FPTBh0Rw.gif","normalizedUrl":"https:\u002F\u002Fimgur.com\u002FPTBh0Rw","thumbnail":"https:\u002F\u002Fi.imgur.com\u002FPTBh0Rwl.jpg","type":"image\u002Fimgur","tags":["ANNOTATED"],"createdAt":"2019-12-01T15:44:18.702Z","updatedAt":"2019-12-02T07:39:05.621Z"},{"id":"65bfd48a-a615-4d13-a5ed-d666c38c6029","url":"https:\u002F\u002Fi.imgur.com\u002Fa9nVC1G.gif","normalizedUrl":"https:\u002F\u002Fimgur.com\u002Fa9nVC1G","thumbnail":"https:\u002F\u002Fi.imgur.com\u002Fa9nVC1Gl.jpg","type":"image\u002Fimgur"

In [8]:
print('為什麼要使用 BeatifulSoup 處理？(因為套件提供功能的相容度好，處理速度快，也有提供解析器解析不同格式)處理後的型態是什麼？ => ', type(soup))

為什麼要使用 BeatifulSoup 處理？(因為套件提供功能的相同度好，處理速度快，也有提供解析器解析不同格式)處理後的型態是什麼？ =>  <class 'bs4.BeautifulSoup'>


### 2. 知乎： https://www.zhihu.com/explore

In [9]:
import requests
url = 'https://www.zhihu.com/explore'

# 加上 Header 即可取回正常資料
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
r = requests.get(url, headers=headers)

r.encoding = 'utf-8'
print(r.text[0:600])

<!doctype html>
<html lang="zh" data-hairline="true" data-theme="light"><head><meta charSet="utf-8"/><title data-react-helmet="true">发现 - 知乎</title><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1"/><meta name="renderer" content="webkit"/><meta name="force-rendering" content="webkit"/><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"/><meta name="google-site-verification" content="FTeR0c8arOPKh8c5DYh_9uu98_zJbaWw53J-Sch9MTg"/><meta name="description" property="og:description" content="有问题，上知乎。知乎，可信赖的问答社区，以让每个人高效获得可信赖的解答为使命。知乎凭借认真、专业和友善的社区氛围，结构化、易获得的


### 3. 觀察一下知乎回來的資料好像有點怪怪的，該怎麼解決？

In [4]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.zhihu.com/explore'

# 加上 Header 即可取回正常資料
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}

r = requests.get(url, headers=headers)
r.encoding = 'utf-8'

soup = BeautifulSoup(r.text.replace(r'\u002F', '/').replace(r'\u003C', '<').replace(r'\u003E', '>'),'html.parser')
 
# print(soup.prettify()) 
# a_tags = soup.find_all('a',class_='ExploreRoundtableCard-title')
# for tag in a_tags:
#   print(tag.string)
print(soup)    

镜片，有哪些注意事项？","url":"https://www.zhihu.com/question/19757269","type":"question","id":19757269,"answerCount":4},{"followerCount":1482,"title":"眼镜镜片应该怎么清洗？","url":"https://www.zhihu.com/question/34429972","type":"question","id":34429972,"answerCount":98},{"followerCount":696,"title":"中国学生近视率极高的问题，真的无解吗？","url":"https://www.zhihu.com/question/41124406","type":"question","id":41124406,"answerCount":66}],"logo":"https://pic4.zhimg.com/50/v2-0fe5aeeefff652ded28014f8d1742a9b_hd.jpg","urlToken":"peijing"},"dianshang":{"startAt":1575252000,"name":"2020 电商增长驱动力","title":"2020 电商增长驱动力","color":"#0000CD","banner":"https://pic4.zhimg.com/50/v2-b8397aa6267cb9ff46008e120dd09370_hd.jpg","tinyBanner":"https://pic3.zhimg.com/50/v2-43559e2c82a8ece54c81fc809bf82748_hd.jpg","endAt":1576166340,"url":"/roundtables/explore/cards/roundtable/dianshang","followersCount":1228,"intro":"过去两年间，物流环境提升、下沉市场购买力崛起，电商行业增速反弹，涌现了许多技术和商业模式的创新，共同激发着供给侧和消费侧的潜能，基于这场圆桌我们从社交、直播、产业、智能、跨境等方面来聊一聊电商行业的增长驱动力。","isFollowing":false,"gue