從《Python 網路爬蟲與資料分析入門實戰》第三章的範例中練習爬蟲

從第三章自由時報今日熱門爬蟲範例中練習自己寫的程式碼

書中原始程式碼來源：https://github.com/jwlin/web-crawler-tutorial/tree/master/ch3

# 今日熱門新聞：(1)爬取資料 (2)儲存資料

###  1.爬取資料

In [1]:
import requests
import re
from bs4 import BeautifulSoup
 
#1.與網站溝通
resp = requests.get('https://news.ltn.com.tw/list/breakingnews/popular')
#2.剖析原始碼
soup = BeautifulSoup(resp.text, 'html5lib')
#3.定位資訊大概位置
liberty = soup.find('ul', 'list').find_all('li')
#4.創造空的清單存放資訊
news=[]

#5.迴圈爬取資料
for a in liberty:
    #5.1創造空的字典存放各筆資訊
    new=dict()
    #5.2創造「標題」:位置span標籤名稱/title屬性的字串
    new['title'] =a.find('span','title').text.strip()
    #5.3創造「時間」：位置span標籤名稱/time屬性的字串
    new['time'] =a.find('span','time').text.strip()
    #5.4創造「網址」:位置a標籤tit屬性的網址
    new['href'] =a.find('a','tit')['href']
    #5.5將各比資訊加回清單中
    news.append(new)
    #5.6印出結果
print(news)

[{'title': '驚！韓國瑜新北造勢時隔3個月 韓粉人數少了75％', 'time': '2019-12-08 19:28', 'href': 'https://news.ltn.com.tw/news/politics/breakingnews/3002676'}, {'title': 'CBA》可悲！全場向吉喆默哀 中國球迷素質讓林書豪動怒了', 'time': '12:10', 'href': 'https://sports.ltn.com.tw/news/breakingnews/3003193'}, {'title': '日學者警告：慎防中共暗殺台灣總統候選人', 'time': '06:18', 'href': 'https://news.ltn.com.tw/news/politics/breakingnews/3002925'}, {'title': '點名柯P一手催生韓流！陸之駿：2致命錯誤註定滅亡', 'time': '01:02', 'href': 'https://news.ltn.com.tw/news/politics/breakingnews/3002938'}, {'title': 'MLB》「歷史級鉅約」來了！ 洋基已開史上最狂合約給柯爾', 'time': '07:33', 'href': 'https://sports.ltn.com.tw/news/breakingnews/3002965'}, {'title': '200餘香港示威者逃往台灣！ 紐時：台北長老教會牧師居中聯繫', 'time': '2019-12-08 17:15', 'href': 'https://news.ltn.com.tw/news/world/breakingnews/3002543'}, {'title': '貨車司機玩命國道逼車、驟停致連環撞 公共危險判刑', 'time': '2019-12-08 22:24', 'href': 'https://news.ltn.com.tw/news/society/breakingnews/3002703'}, {'title': '超諷刺！陳玉珍粉專疑關閉 臉書顯示「右手」包紮OK繃', 'time': '2019-12-08 18:57', 'href': 'https://news.ltn

### 2.處理資料並儲存

In [2]:
import pandas as pd
import numpy as np  

#6.處理資料並儲存
    #6.1將資料轉成dataframe
test=pd.DataFrame(data=news)
    #6.2將索引值從1開始
test.index = np.arange(1,len(test)+1)
    #6.3將索引值重新命名為編號
test.index.names = ['news_NO.']
    #6.4印出結果
print(test)
    #6.5儲存成csv檔
test.to_csv('news.csv',encoding='utf_8_sig')

                                                       href              time  \
news_NO.                                                                        
1         https://news.ltn.com.tw/news/politics/breaking...  2019-12-08 19:28   
2         https://sports.ltn.com.tw/news/breakingnews/30...             12:10   
3         https://news.ltn.com.tw/news/politics/breaking...             06:18   
4         https://news.ltn.com.tw/news/politics/breaking...             01:02   
5         https://sports.ltn.com.tw/news/breakingnews/30...             07:33   
6         https://news.ltn.com.tw/news/world/breakingnew...  2019-12-08 17:15   
7         https://news.ltn.com.tw/news/society/breakingn...  2019-12-08 22:24   
8         https://news.ltn.com.tw/news/politics/breaking...  2019-12-08 18:57   
9         https://sports.ltn.com.tw/news/breakingnews/30...  2019-12-08 22:02   
10        https://sports.ltn.com.tw/news/breakingnews/30...  2019-12-08 23:03   
11        https://news.ltn.c

# 無註解程式碼

In [None]:
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np  

resp = requests.get('https://news.ltn.com.tw/list/breakingnews/popular')
soup = BeautifulSoup(resp.text, 'html5lib')
liberty = soup.find('ul', 'list').find_all('li')
news=[]

for a in liberty:
    new=dict()
    new['title'] =a.find('span','title').text.strip()
    new['time'] =a.find('span','time').text.strip()
    new['href'] =a.find('a','tit')['href']
    news.append(new)

print(news)


test=pd.DataFrame(data=news)
test.index = np.arange(1,len(test)+1)
test.index.names = ['news_NO.']
print(test)
test.to_csv('news.csv',encoding='utf_8_sig')

# urllib套件版本的爬蟲

In [None]:
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np  

url='https://news.ltn.com.tw/list/breakingnews/popular'

resp = urllib.request.urlopen(url).read()
soup = BeautifulSoup(resp, 'html5lib')
liberty = soup.find('ul', 'list').find_all('li')
news=[]

for a in liberty:
    new=dict()
    new['title'] =a.find('span','title').text.strip()
    new['time'] =a.find('span','time').text.strip()
    new['href'] =a.find('a','tit')['href']
    news.append(new)

print(news)


test=pd.DataFrame(data=news)
test.index = np.arange(1,len(test)+1)
test.index.names = ['news_NO.']
print(test)
test.to_csv('news.csv',encoding='utf_8_sig')