<a href="https://colab.research.google.com/github/Yiting916/tibame_python/blob/main/0114.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HTTP 代碼


``` Python
2 開頭：成功
3 開頭：轉址。對面的網址搬家了, 雖然你輸入的是舊網址, 他還是會自動幫你轉到新網址
4 開頭：錯誤
404 Not found: 最常見。網址打錯。
403 Forbidden:
(1) IP 被 ban 掉: 可能有數次的惡意行為(短時間送出太多次req.) -> 解法: 等待或換IP
(2) 行為被懷疑是一個程式 -> 解法: 學得像瀏覽器一點(把header完善)
```



# Header


```Python
你在送出網址時, 其實不只是送出網址而已
你送出的是: 網址+額外訊息(request header)

你在得到回應的時候, 其實不只有得到回應
你收到的是: 回應+額外訊息(response header)
```



# 檔案處理


``` Python
1. 純文字(text)
open(...., "r" or "w", encoding = "utf-8")
2. 非純文字(圖片, pdf, doc...)
open(...., "rb" or "wb")
```



In [1]:
# 練習1: 下載 ptt 表特版上的圖片
import urllib.request as req
import bs4 as bs

url = "https://www.ptt.cc/bbs/Beauty/M.1736755829.A.02A.html"
r = req.Request(url)
# 先加 header 資訊, 假裝是瀏覽器
r.add_header("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36")
response = req.urlopen(r)
# 用bs4解析出網頁原始碼
html = bs.BeautifulSoup(response)

# 篩選出網址結尾為圖檔副檔名的
allow_subname = ["jpg", "jpeg", "png", "gif"]
# 找出網頁中網址的部分
links = html.find_all("a")
for l in links:
  href = l["href"]
  # 解析網址字串, 以取得副檔名
  subname = href.split(".")[-1]
  if subname.lower() in allow_subname:
    # 送出req.以取得圖片資訊
    r = req.Request(href)
    r.add_header("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36")
    img = req.urlopen(r)
    # open 一個新檔案來儲存圖片
    fname = href.split("/")[-1]
    f = open(fname, "wb")
    f.write(img.read())
    f.close()

In [None]:
# 練習2-0：檔案路徑設定
import os

fpath = "test1/test2"
fname = fpath+"/"+"a.txt"
# 要先建資料夾才有辦法存檔
if not os.path.exists(fpath):
  os.makedirs(fpath)

f = open(fname, "w", encoding="utf-8")
f.write("abcdefg")
f.close()

In [2]:
# 練習2: 印出作者/標題/時間/看板/推噓文, 並存為json
import os
import json
import urllib.request as req
import bs4 as bs

url = "https://www.ptt.cc/bbs/Beauty/M.1736733511.A.738.html"
# 準備要存檔的檔名和資料夾名稱(檔案路徑)
url_split = url.split("/")
# json檔名
fname = url_split[-1] + ".json"
# 資料夾名 (Beauty)
dirname = url_split[-2]
if not os.path.exists(dirname):
  os.makedirs(dirname)
# 檔案路徑 (資料夾名+檔名)
fpath = dirname + "/" + fname

r = req.Request(url)
# 先加 header 資訊, 假裝是瀏覽器
r.add_header("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36")
response = req.urlopen(r)
# 用bs4解析出網頁原始碼
html = bs.BeautifulSoup(response)
# 找標題資訊：article-meta-value
meta_list = html.find_all("span", "article-meta-value")
# 分割字串取出 author & nickname
author = meta_list[0].text
# 作者
new_author = author.split("(")[0].strip()
# 暱稱
idx1 = author.find("(") + 1
idx2 = author.find(")")
nickname = author[idx1: idx2]
# 看板
board = meta_list[1].text
# 標題
title = meta_list[2].text
new_title = title.split("]")[1].strip()
# 從標題取分類
idx1 = title.find("[") + 1
idx2 = title.find("]")
category = title[idx1: idx2]
# 時間
time = meta_list[3].text

print("作者:", new_author)
print("暱稱:", nickname)
print("看板:", board)
print("分類:", category)
print("標題:", new_title)
print("時間:", time)
print("-"*30)

# 準備json檔內容來存檔
data = {
    "id": new_author,
    "nick name": nickname,
    "board name": board,
    "category": category,
    "title": new_title,
    "time": time,
    "push content": []
}

# 找推文：push
push_list = html.find_all("div", {"class": "push"})
for p in push_list:
  # 推/ 平/ 噓
  tag = p.find("span", "push-tag").text
  # id
  id = p.find("span", "push-userid").text
  # 內容
  content = p.find("span", "push-content").text.replace(": ", "").strip()
  # ip & 時間
  ipdate = p.find("span", "push-ipdatetime").text.strip()
  print(tag, id, content, ipdate)

  # 準備推文的 json 資料
  push_data = {
      "type": tag,
      "id": id,
      "content": content,
      "IP & date": ipdate
  }
  data["push content"].append(push_data)

# 存檔
f = open(fpath, "w", encoding="utf-8")
# 之前是f.write, 現在使用json.dump幫我做更好的write
json.dump(data, f, ensure_ascii=False, indent=4)
f.close()

作者: JANUARZ
暱稱: 社會職人
看板: Beauty
分類: 正妹
標題: 有些角度像楊謹華
時間: Mon Jan 13 09:58:29 2025
------------------------------
推  ninaman 正 101.12.146.84 01/13 10:05
推  deltarobot . 49.217.122.9 01/13 10:07
推  elfindor 優 223.137.175.186 01/13 10:16
推  wglhe 優版派克 42.77.77.102 01/13 10:17
推  Uncontinue 正正 122.118.35.37 01/13 10:38
推  Williamtsou 門 27.53.230.154 01/13 10:52
推  aass5566 就是本人 42.77.55.73 01/13 11:10
→  bingreen 正 111.184.234.172 01/13 11:17
推  openbook13 優質皮朋 49.215.58.105 01/13 11:29
推  okbon 不錯喔 118.231.152.241 01/13 13:03
噓  TopGun2 明明就是 張鳳書 114.33.106.198 01/13 13:12
→  wl00669773 之前現場聊天過，一直覺得本人比照片 223.141.125.149 01/13 13:26
→  wl00669773 漂亮 223.141.125.149 01/13 13:27
噓  hmt17 許光漢？ 49.216.52.98 01/13 14:26
推  WasJohnWall 派克？ 61.228.67.71 01/13 15:32
推  durian0308 讚 42.78.236.55 01/13 15:51
推  saw6904 楊謹華天花版 42.79.150.39 01/13 18:24
推  ruffryders 正翻 42.78.17.7 01/13 19:57
推  a3300689 已追蹤 推藏頭 1.171.153.189 01/14 00:32
→  a9564208 比較像 常威 101.9.96.122 01/14 07:33
推  clkdtm32 比楊謹華正太多了吧 1



``` Python
聽故事!!!! 稍微知道就可以了

我們之前說import你就是要import到某個.py

照抄.功能() 照抄...

我們從來沒遇過妳某隻檔案裏面

os(檔案).xxxx.功能

os.py 兩行

import posixpath(檔案) as path
import ntpath(檔案) as path

這個path就是這個
!!! 因為外部使用者並不會想自己分辨
!!! 封裝(讓外部使用者感覺都是同一個exists)
!!! 但裡面幫妳導到不同作業系統的指令
!!! 妳外部永遠只要同一行 os.path.exists


在ntpath和posixpath裡面都有一行
from genericpath import *

事實上妳是吃到genericpath.py裡面的exists

exists是存在在genericpath.py裡面的

!!! 因為exists這指令在windows/mac/linux都是一樣的
!!! 那我們絕對不會重複一直寫, 所以我們就拉出一隻檔案把它定義裡面
```



In [None]:
# 01/20 加入創資料夾
import os
import urllib.request as req
import bs4 as bs

url = "https://www.ptt.cc/bbs/Beauty/M.1736755829.A.02A.html"
# 創造資料夾
dirname = url.split("/")[-1]
if not os.path.exists(dirname):
    os.makedirs(dirname)

r = req.Request(url)
r.add_header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0")

response = req.urlopen(r)
html = bs.BeautifulSoup(response)

allow_subname = ["jpg", "jpeg", "png", "gif"]
links = html.find_all("a")
for l in links:
    href = l["href"]
    subname = href.split(".")[-1]
    if subname.lower() in allow_subname:
        r = req.Request(href)
        r.add_header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0")
        img = req.urlopen(r)
        fname = href.split("/")[-1]
        # 完整路徑(用字串加法是有點累的 "/", os.path.join不用)
        fpath = os.path.join(dirname, fname)
        f = open(fpath, "wb")
        f.write(img.read())
        f.close()


In [20]:
import os
import urllib.request as req
import bs4 as bs

def dl_post_image(url, folder_name):
    # url = "https://www.ptt.cc/bbs/Beauty/M.1736755829.A.02A.html"
    # 創造資料夾
    dirname = url.split("/")[-1]
    dirname = os.path.join(folder_name, dirname)
    if not os.path.exists(dirname):
        os.makedirs(dirname)

    r = req.Request(url)
    r.add_header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0")

    response = req.urlopen(r)
    html = bs.BeautifulSoup(response)

    allow_subname = ["jpg", "jpeg", "png", "gif"]
    links = html.find_all("a")
    for l in links:
        href = l["href"]
        subname = href.split(".")[-1]
        if subname.lower() in allow_subname:
            r = req.Request(href)
            r.add_header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0")
            img = req.urlopen(r, timeout=30)
            fname = href.split("/")[-1]
            # 完整路徑(用字串加法是有點累的 "/", os.path.join不用)
            fpath = os.path.join(dirname, fname)
            f = open(fpath, "wb")
            f.write(img.read())
            f.close()

# 練習 3 (下次檢討)


``` Python
url = "https://www.ptt.cc/bbs/Beauty/index.html"
把整頁的的文章的圖片都下載, 並且放在不同資料夾裡
!!! 網址是縮寫的
!!! 有些文章是被刪文的
```



In [21]:
import os
import json
import urllib.request as req
import bs4 as bs

url = "https://www.ptt.cc/bbs/Beauty/index3950.html"

r = req.Request(url)
# 先加 header 資訊, 假裝是瀏覽器
r.add_header("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36")
response = req.urlopen(r)
# 用bs4解析出網頁原始碼
html = bs.BeautifulSoup(response)
# 找標題資訊：title
titles = html.find_all("div", {"class":"title"})

for t in titles:
  post_link = t.find("a")
  # t裡有a: 正常的
  if not post_link == None:
    post_url = "https://www.ptt.cc" + post_link["href"]
    print(post_url)
    # 抓取 post 裡的圖片並下載存於ptt資料夾中
    dl_post_image(post_url, "ptt")
  # t裡沒有a: 被刪文了
  else:
    print("這篇被刪文了")


https://www.ptt.cc/bbs/Beauty/M.1737162050.A.EA1.html
https://www.ptt.cc/bbs/Beauty/M.1737164418.A.B9B.html
https://www.ptt.cc/bbs/Beauty/M.1737164759.A.625.html
這篇被刪文了
https://www.ptt.cc/bbs/Beauty/M.1737172640.A.4C5.html
https://www.ptt.cc/bbs/Beauty/M.1737177354.A.3CF.html
https://www.ptt.cc/bbs/Beauty/M.1737179262.A.1E8.html
https://www.ptt.cc/bbs/Beauty/M.1737182986.A.920.html
https://www.ptt.cc/bbs/Beauty/M.1737189398.A.CAA.html
https://www.ptt.cc/bbs/Beauty/M.1737192599.A.90F.html
https://www.ptt.cc/bbs/Beauty/M.1737193583.A.A22.html
https://www.ptt.cc/bbs/Beauty/M.1737209212.A.D66.html
https://www.ptt.cc/bbs/Beauty/M.1737248469.A.BC8.html
https://www.ptt.cc/bbs/Beauty/M.1737248570.A.E9D.html
https://www.ptt.cc/bbs/Beauty/M.1737248677.A.D05.html
https://www.ptt.cc/bbs/Beauty/M.1737248796.A.252.html
https://www.ptt.cc/bbs/Beauty/M.1737251853.A.27E.html
https://www.ptt.cc/bbs/Beauty/M.1737256223.A.C91.html
https://www.ptt.cc/bbs/Beauty/M.1737258056.A.744.html


URLError: <urlopen error timed out>

In [19]:
# Colab 無法刪除整個資料夾, 可以用以下兩種方式來刪
# (1) 下 command line 刪
# ! rm -rf <folder_name>
# ! rm -rf ptt
# (2) os.remove & os.rmdir 一個個刪
# import os

# for p in os.walk("ptt"):
#   for fname in p[2]:
#     fpath = os.path.join(p[0], fname)
#     os.remove(fpath)
#   for dirname in p[1]:
#     dpath = os.path.join(p[0], dirname)
#     os.rmdir(dpath)

# os.rmdir("ptt")