# 靜態網頁的資料爬蟲策略


* 了解靜態網頁的資料爬蟲策略
* 認識適用於靜態網頁爬蟲的相關套件工具：Request
* 認識適用於靜態網頁爬蟲的相關套件工具：BeatifulSoup

## 作業目標

利用 Request + BeatifulSoup 爬取下列兩個網站內容並解析：

1. Dcared 網址： https://www.dcard.tw/f
2. 知乎： https://www.zhihu.com/explore

並且回答下面問題：

1. Request 取回之後該怎麼取出資料，資料型態是什麼？
2. 為什麼要使用 BeatifulSoup 處理？處理後的型態是什麼？
3. 觀察一下知乎回來的資料好像有點怪怪的，該怎麼解決？

### 1. Dcard 網址： https://www.dcard.tw/f

In [8]:
import requests
from bs4 import BeautifulSoup

In [9]:
url = 'https://www.dcard.tw/f'
r = requests.get(url)
r.encoding = 'utf-8'
print(r.text[0:3000])

<!DOCTYPE html><html lang="zh-Hant-TW"><head prefix="og: http://ogp.me/ns#" itemscope="" itemType="https://schema.org/WebSite"><title data-react-helmet="true">Dcard</title><meta data-react-helmet="true" property="og:image" content="https://www.dcard.tw/build/landing-c9e7b8fb.png"/><meta data-react-helmet="true" property="og:image:secure_url" content="https://www.dcard.tw/build/landing-c9e7b8fb.png"/><meta data-react-helmet="true" charSet="utf-8"/><meta data-react-helmet="true" http-equiv="X-UA-Compatible" content="IE=edge"/><meta data-react-helmet="true" name="application-name" content="Dcard"/><meta data-react-helmet="true" name="apple-itunes-app" content="app-id=951353454"/><meta data-react-helmet="true" name="theme-color" content="#006aa6"/><meta data-react-helmet="true" name="mobile-web-app-capable" content="yes"/><meta data-react-helmet="true" name="apple-mobile-web-app-capable" content="yes"/><meta data-react-helmet="true" property="fb:app_id" content="211628828926493"/><meta dat

In [11]:
print('Request 取回之後該怎麼取出資料，資料型態是什麼？ =>',type(r.text))



Request 取回之後該怎麼取出資料，資料型態是什麼？ => <class 'str'>


In [13]:
soup = BeautifulSoup(r.text, "html5lib")
print('為什麼要使用 BeatifulSoup 處理？處理後的型態是什麼？ => ',type(soup))



為什麼要使用 BeatifulSoup 處理？處理後的型態是什麼？ =>  <class 'bs4.BeautifulSoup'>


### 2. 知乎： https://www.zhihu.com/explore

In [14]:
url = 'https://www.zhihu.com/explore'
r = requests.get(url)
r.encoding = 'utf-8'

print(r.text[0:600])

<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>openresty</center>
</body>
</html>



### 3. 觀察一下知乎回來的資料好像有點怪怪的，該怎麼解決？

In [1]:
import requests
url = 'https://www.zhihu.com/explore'

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-language': 'zh-TW,zh;q=0.9,en-US;q=0.8,en;q=0.7',
    'cache-control': 'max-age=0',
    'cookie': '_xsrf=3fZOonG218JMoiZNDql4JewBp8f0Lnfn; _zap=e3e408ee-c0f5-4e17-8b5a-52e9bd23b264; d_c0="ADDiOrxOghCPTnksGwd82EtwKj7GcCQ7euU=|1576402699"; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1576402712,1576422763; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1576422763; l_n_c=1; q_c1=b5ff17403ce5464aa6b6e5ed806d1b81|1576422824000|1576422824000; n_c=1; __utma=51854390.259669286.1576422839.1576422839.1576422839.1; __utmb=51854390.0.10.1576422839; __utmc=51854390; __utmz=51854390.1576422839.1.1.utmcsr=localhost:8889|utmccn=(referral)|utmcmd=referral|utmcct=/notebooks/Documents/GitHub/1st-PyCrawlerMarathon/Data/Day008_Sample.ipynb; __utmv=51854390.010--|3=entry_date=20191215=1; l_cap_id="ZDJjOThkNTMwM2YzNDllY2I1ZjBjODViYzAyMDdmYjM=|1576423531|b793983eac83ead0b5a8c579c6be159dcf4b9e34"; r_cap_id="NmU5MGEzYjJlYTY2NDJkOTgzNDJiMTc0OTRlZmQ1YzE=|1576423531|17aa067b76906b1736736245a892d0e87a6a18b3"; cap_id="YTZmOTNhNmQzMDI3NGQ3OWE1OWM5NzhmNWU3OGIwZmY=|1576423531|e13f2144ca9d3bd7eb4df598049a4c7ea02dbb94"; tgw_l7_route=3b88c43780310e1e8b8457df684cff66',
    'referer': 'http://localhost:8889/notebooks/Documents/GitHub/1st-PyCrawlerMarathon/Data/Day008_Sample.ipynb',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
r = requests.get(url, headers=headers)


r.encoding = 'utf-8'

print(r.text[0:600])

<!doctype html>
<html lang="zh" data-hairline="true" data-theme="light"><head><meta charSet="utf-8"/><title data-react-helmet="true">发现 - 知乎</title><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1"/><meta name="renderer" content="webkit"/><meta name="force-rendering" content="webkit"/><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"/><meta name="google-site-verification" content="FTeR0c8arOPKh8c5DYh_9uu98_zJbaWw53J-Sch9MTg"/><meta name="description" property="og:description" content="有问题，上知乎。知乎，可信赖的问答社区，以让每个人高效获得可信赖的解答为使命。知乎凭借认真、专业和友善的社区氛围，结构化、易获得的
