# 爬蟲程式

### 網站: [BookCool](https://www.bookscool.com/)

### **著作權申明**

![](https://i.imgur.com/pUugNmb.png)
![](https://i.imgur.com/LsqLzyz.png)

## **免責聲明**
請大家仔細閱讀條文後再開始執行下面的程式

> **請勿將本次資料集散播、用於商業用途。僅能用於學術用途**
> 
> **尊重著作權，人人有責**
>
> **請下載過50年著作的文章**
>
> **請大家執行一次就好，並下載到本地，不要讓對方覺得我們在攻擊他**
>
> 下載到本地: 左邊的檔案 > 右鍵下載 (如果chrome阻擋下載多個檔案，右上角設為允許)

## **使用方法**
1. 下載一本書 download_one_book (Block 5)
> 進入[BookCool](https://www.bookscool.com/) > 選中作者(或系列) > 選中書 > 進入書 > 複製書url貼上去執行

2. 下載一個系列 download_many_book (Block 7)
> 進入[BookCool](https://www.bookscool.com/) > 選中作者(或系列) > 複製作者(或系列)url貼上去執行

In [1]:
# 開始執行前請先在心中默念著作權相關法律，並發誓只將資料集用於教學用途
# 請下載失去著作權保護期的書籍 (50年)
# 執行即代表已完成閱讀著作權辦法，並同意遵循著作權相關條例

import requests
from lxml import etree
import re 
import urllib
from tqdm import tqdm
import os

if(os.path.isdir("./books") == False):
  os.mkdir("./books")

In [2]:
def get_book_information(url):
  url_chinese = urllib.parse.unquote(url)
  url_re = re.compile(r'com/(.*)\.php')
  url_re2 = re.compile(r'.*\.php')
  url_re3 = re.compile(r'\.(html|xhtml)')
  save_path = "./books/"+url_re.search(url_chinese).group(1).strip()+".txt"
  url_book_domain = url_re2.match(url).group(0)
  html_format = url_re3.search(url).group(0)
  print("儲存位置: "+save_path)
  return save_path,url_book_domain,html_format

In [3]:
def get_pages_of_book(url):
  response = requests.get(url+"#book_toc")
  html = etree.HTML(response.content)
  content_number = len(html.xpath('.//div[@data-role="content"]//ul/li'))
  return content_number

In [4]:
def download_one_book(url):
  save_path,url_book_domain,html_format = get_book_information(url)
  content_number = get_pages_of_book(url)
  page_start = 2 if(html_format == ".html") else 1
  page_end = page_start+content_number
  file = open(save_path,"w",encoding="utf8")
  for page in tqdm(range(page_start,page_end)):
      if(html_format == ".html"):
          url_ = url_book_domain+"/"+(str(page) if page>9 else "0"+str(page))+html_format
      else:
          url_ = url_book_domain+"/"+str(page)+html_format
      response = requests.get(url_)
      html = etree.HTML(response.content.decode("utf-8","replace"))
      if(html_format == ".html"):
          content = html.xpath('.//div[@data-role="content"]/p/text()')
      else:
          content = html.xpath('.//div[@data-role="content"]/div/text()')
      assey = [a.strip().replace("\u3000"," ") for a in content]
      file.write("\n".join(assey)+"\n")
  file.flush()
  file.close()

> 進入[BookCool](https://www.bookscool.com/) > 選中作者(或系列) > 選中書 > 進入書 > 複製書url貼上去執行

In [None]:
url = "https://www.bookscool.com/%E3%80%8A%E5%82%B3%E5%A5%87%E3%80%8B.php/3.xhtml"
download_one_book(url)

儲存位置: ./books/《傳奇》.txt


100%|██████████| 19/19 [00:15<00:00,  1.19it/s]


In [5]:
def download_many_books(url):
  response = requests.get(url)
  html = etree.HTML(response.content)
  links = html.xpath('.//div[@data-role="content"]/a/@href')
  links = list(filter(lambda link:re.match("http",link) == None,links))
  for l in links:
      link = "https://www.bookscool.com/"+l
      download_one_book(link)

> 進入[BookCool](https://www.bookscool.com/) > 選中作者(或系列) > 複製作者(或系列)url貼上去執行

In [6]:
url ="https://www.bookscool.com/harrypotter"
download_many_books(url)

儲存位置: ./books/哈利波特1神秘的魔法石.txt


100%|██████████| 17/17 [00:12<00:00,  1.40it/s]


儲存位置: ./books/哈利波特2消失的密室.txt


100%|██████████| 18/18 [00:13<00:00,  1.29it/s]


儲存位置: ./books/哈利波特3阿茲卡班的逃犯.txt


100%|██████████| 22/22 [00:17<00:00,  1.27it/s]


儲存位置: ./books/哈利波特4火盃的考驗.txt


100%|██████████| 37/37 [00:28<00:00,  1.31it/s]


儲存位置: ./books/哈利波特5鳳凰會的密令.txt


100%|██████████| 38/38 [00:31<00:00,  1.21it/s]


儲存位置: ./books/哈利波特6混血王子的背叛.txt


100%|██████████| 30/30 [00:24<00:00,  1.25it/s]


儲存位置: ./books/哈利波特7死神的聖物.txt


100%|██████████| 36/36 [00:27<00:00,  1.29it/s]
