# 靜態網頁的資料爬蟲策略


* 了解靜態網頁的資料爬蟲策略
* 認識適用於靜態網頁爬蟲的相關套件工具：Request
* 認識適用於靜態網頁爬蟲的相關套件工具：BeatifulSoup

## 網頁資料取得： Requests

In [45]:
import requests
# 引入函式庫
r = requests.get('https://github.com/timeline.json')
# 想要爬資料的目標網址
response = r.text
# 模擬發送請求的動作

In [4]:
print(response)
print(type(response))
response["message"]

{"message":"Hello there, wayfaring stranger. If you’re reading this then you probably didn’t see our blog post a couple of years back announcing that this API would go away: http://git.io/17AROg Fear not, you should be able to get what you need from the shiny new Events API instead.","documentation_url":"https://developer.github.com/v3/activity/events/#list-public-events"}
<class 'str'>


TypeError: string indices must be integers

In [5]:
print(type(response))

# print(response['message'])
print(response[:100])

<class 'str'>
{"message":"Hello there, wayfaring stranger. If you’re reading this then you probably didn’t see our


In [6]:
import json
response = json.loads(response)

print(type(response))
print(response['message'])

<class 'dict'>
Hello there, wayfaring stranger. If you’re reading this then you probably didn’t see our blog post a couple of years back announcing that this API would go away: http://git.io/17AROg Fear not, you should be able to get what you need from the shiny new Events API instead.


## 網頁解析器： BeatifulSoup


In [4]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" id="title-id"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>
"""

print(html_doc)
print(type(html_doc))


<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" id="title-id"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>

<class 'str'>


In [5]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, "html5lib")
print(soup)
print(type(soup))

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" id="title-id"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>


</body></html>
<class 'bs4.BeautifulSoup'>


In [41]:
#soup.p['id']
soup.find_all('a', class_='sister')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

## 作業目標

利用 Request + BeatifulSoup 爬取下列兩個網站內容並解析：

1. Dcard 網址： https://www.dcard.tw/f
2. 知乎： https://www.zhihu.com/explore

並且回答下面問題：

1. Request 取回之後該怎麼取出資料，資料型態是什麼？
2. 為什麼要使用 BeatifulSoup 處理？
3. 觀察一下知乎回來的資料好像有點怪怪的，該怎麼解決？

In [61]:
r = requests.get('https://www.dcard.tw/f')

response = r.text
#response
#data = json.loads(response)
#data
soup = BeautifulSoup(response, "html5lib")#lxml
#print(soup)
print(type(soup))
#print(soup.prettify())

p=soup.find_all('a', class_='PostEntry_root_V6g0rd')
print(type(p))

for t in p:
    ti=t.find('h3', class_='Title__Text-v196i6-0 gmfDU')
    if(ti):
        print(ti.text)


<class 'bs4.BeautifulSoup'>
<class 'bs4.element.ResultSet'>
你讓我活在電影裡
日本IKEA鯊魚
更 體會到有慾望卻無法滅掉的感覺(微西斯
一早醒來看到壞消息
偷偷進化的嘴笨男友
2020來了，別再回頭2019
【16歲那年，我得到了IELTS 7.0】(英文進步方法)
我在火車上被攻擊
（更）男友說我是普妹
香蕉貼膠帶 藝術，收藏家出365萬
什麼人吸引什麼人嗎？我的歷任男友共同點
男友耳後粉刺（內有超療癒洞洞照）
柴犬
看完MAMA/MMA的想法
是EXO變了 還是只是我不了解他們
你們很棒 表演很精彩
所謂的約砲其實很簡單
終於告白了！
又再次被黑五燒到燙手的我😭
噁男才不知道自己是噁男


In [44]:
headers = {'user-agent': 'my-app/0.0.1'}
r = requests.get('https://www.zhihu.com/explore',headers=headers)

response = r.text
#response
#data = json.loads(response)
#data
soup = BeautifulSoup(response, "html5lib")
#print(soup.prettify())
titles=soup.find_all('a', class_='ExploreSpecialCard-title')
for t in titles:
    print(t.text)

中超大结局，广州恒大豪夺第八冠
向时间开杠，抗老当然有正确答案
向上生活，渐入「家」境
当「猝死」离我们越来越近
